Re: [webpages-l] sitemap_m.html
My oh my, I never imagined that the program would be attempting such a level of intelligence. The filenames were never a part of the public interface and have often been chosen rather whimsically. Certainly there has been little, if any attempt made to regularize them., Considering that, it's a wonder that this works at all, much less as well as it does. I had concluded, naively as it turns out, that the program was using a simpler algorithm something akin to this: Traverse the tree top down where a node is a link not previously seen before and not a link offsite and not a link which is above the current level. Repeat with the new node. If no more nodes at this level, pop up a level and continue with the next node. If at top level, quit. Anyway, given these circumstances, what is your vision for the future of the sitemap file? It appears that it cannot reliably be made in full detail programmatically. One could hand-edit it to make it right, but then this calls for continual maintenance whenever a new file is created in the future. But is anyone really ready to sign up for this job? Rick At 10:06 PM 8/31/99 -0400, Jim Eggert wrote:
Rick wrote:
Hmmm, I wonder why it sometimes misses files. For example, it finds the German Rheinland-Pfalz history page, but not the English version:
The program merely looks for associated filenames to make the file pairings. It fails here because I didn't know about or anticipate filenames with language codes used as an infix:
/gene/reg/RHE-PFA/rhein-p-his.html (E) /gene/reg/RHE-PFA/rhein-p-d-his.html (D) ^^ -d in the middle of the filename
The E (English) file isn't found correctly because its name isn't obviously derivable (at least to the program) from the D filename. In fact, the E file isn't found at all because the D file doesn't link to it, and neither does the parent of the D file. (The crawler I wrote only parses one file from a language multiplet; in this case, the German-language parent was parsed, while the English one was not.)
I could make the crawler look for infixed language tags in the filenames. This gives it more chances for false associations, however, so I would prefer that the German file name be changed to a simpler /gene/reg/RHE-PFA/rhein-p-his-d.html (D)
Another case is (mea culpa) /gene/reg/NSAC/schaumburg-lippe_adel.html (D) /gene/reg/NSAC/schaumburg-lippe_nobility.html (E) and a simple file rename (with accompanying link updates) will fix this too.
A worse problem, in my opinion, is presented by the sometimes poorly-chosen page titles. This can only be cured by careful attention by the page authors.
-- =Jim Eggert EggertJ@LL.mit.edu
Rick wrote:
I had concluded, naively as it turns out, that the program was using a simpler algorithm something akin to this:
Traverse the tree top down where a node is a link not previously seen before and not a link offsite and not a link which is above the current level. Repeat with the new node. If no more nodes at this level, pop up a level and continue with the next node. If at top level, quit.
The trouble is that the desired conceptual tree doesn't really exist. The collection of links (viewed as pointers from one file to another) does not produce a tree. The directory hierarchy is a tree, but it isn't quite the organization one wants, because it is a disk map, not a site map. So I took the middle road, using the links projected onto the disk hierarchy (or the other way around, if that makes more sense to you), the result being a true site map with a structure partially imposed by the disk hierarchy. It's really a cute scheme; perhaps I should patent it.
Anyway, given these circumstances, what is your vision for the future of the sitemap file? It appears that it cannot reliably be made in full detail programmatically. One could hand-edit it to make it right, but then this calls for continual maintenance whenever a new file is created in the future. But is anyone really ready to sign up for this job?
Actually the program I wrote is handling it pretty well. For the most part people have been pretty good about naming language variants of the same file in a sensible manner. The program is pretty good about the few exceptions to the rule. It reads hints to the exceptions from the sitemap file itself <!-- encoded as simple HTML comments --> and writes these hints back to its output in the same format. So one can hand-edit hints into the sitemap and the hints stay in later generations of the sitemap as long as they make sense. (If the files to which the hints refer disappear, the hints go away.) There are about eight hints in the file as I have it currently. This isn't bad at all, especially since they don't have to be maintained any more, unless new anomalies are created. Most of them are hints not to parse a file (like the site map itself, which otherwise would become the mother of all files). In the latest version I've generated here, I exclude image files from the outline (standalone images were included before) and catch some files I had been missing before. So my vision for this is that I make the program platform-independent (I took a couple of shortcuts in MacPerl) and ship it over to Rainer and Arthur. They install a cron task to run it once a week or so. And then someone should keep an eye on it once in a while to make sure that people are naming files reasonably, and if not, install hints in the sitemap file itself. If that doesn't work, I can always do the cron task here and e-mail the results. This is less desirable in my view. One side effect of this exercise is that I have learned that the FAQs listing /gene/faqs/FAQ.html is referenced only once on our server, in one of the SwissGen pages. Somehow this doesn't seem right. I may be able to find other such nearly forgotten pages. If so, I'll make you aware of it. Of course, all this is really designed so that Rainer can't figure out what the blazes is going on. :-) Who'da thunk that an Ami would be imposing _more_ order on a German system? (really big :-) -- =Jim Eggert EggertJ@LL.mit.edu
participants (2)
-
Jim Eggert
-
Richard Heli