My oh my, I never imagined that the program would be attempting such a level of intelligence. The filenames were never a part of the public interface and have often been chosen rather whimsically. Certainly there has been little, if any attempt made to regularize them., Considering that, it's a wonder that this works at all, much less as well as it does. I had concluded, naively as it turns out, that the program was using a simpler algorithm something akin to this: Traverse the tree top down where a node is a link not previously seen before and not a link offsite and not a link which is above the current level. Repeat with the new node. If no more nodes at this level, pop up a level and continue with the next node. If at top level, quit. Anyway, given these circumstances, what is your vision for the future of the sitemap file? It appears that it cannot reliably be made in full detail programmatically. One could hand-edit it to make it right, but then this calls for continual maintenance whenever a new file is created in the future. But is anyone really ready to sign up for this job? Rick At 10:06 PM 8/31/99 -0400, Jim Eggert wrote:
Rick wrote:
Hmmm, I wonder why it sometimes misses files. For example, it finds the German Rheinland-Pfalz history page, but not the English version:
The program merely looks for associated filenames to make the file pairings. It fails here because I didn't know about or anticipate filenames with language codes used as an infix:
/gene/reg/RHE-PFA/rhein-p-his.html (E) /gene/reg/RHE-PFA/rhein-p-d-his.html (D) ^^ -d in the middle of the filename
The E (English) file isn't found correctly because its name isn't obviously derivable (at least to the program) from the D filename. In fact, the E file isn't found at all because the D file doesn't link to it, and neither does the parent of the D file. (The crawler I wrote only parses one file from a language multiplet; in this case, the German-language parent was parsed, while the English one was not.)
I could make the crawler look for infixed language tags in the filenames. This gives it more chances for false associations, however, so I would prefer that the German file name be changed to a simpler /gene/reg/RHE-PFA/rhein-p-his-d.html (D)
Another case is (mea culpa) /gene/reg/NSAC/schaumburg-lippe_adel.html (D) /gene/reg/NSAC/schaumburg-lippe_nobility.html (E) and a simple file rename (with accompanying link updates) will fix this too.
A worse problem, in my opinion, is presented by the sometimes poorly-chosen page titles. This can only be cured by careful attention by the page authors.
-- =Jim Eggert EggertJ@LL.mit.edu