New subject: [webpages-l] sitemap_m.html

2 Sep 1999

      My oh my, I never imagined that the program would be attempting
such a level of intelligence.  The filenames were never a part 
of the public interface and have often been chosen rather
whimsically.  Certainly there has been little, if any attempt
made to regularize them., Considering that, it's a wonder that 
this works at all, much less as well as it does.

I had concluded, naively as it turns out, that the program was
using a simpler algorithm something akin to this:

        Traverse the tree top down
        where a node is a link not previously seen before
        and not a link offsite
        and not a link which is above the current level.
        Repeat with the new node.
        If no more nodes at this level, pop up a level
        and continue with the next node.
        If at top level, quit.

Anyway, given these circumstances, what is your vision for
the future of the sitemap file?  It appears that it cannot
reliably be made in full detail programmatically.  One could
hand-edit it to make it right, but then this calls for continual
maintenance whenever a new file is created in the future.  But
is anyone really ready to sign up for this job?  

Rick

At 10:06 PM 8/31/99 -0400, Jim Eggert wrote:
...
Rick wrote:
...
Hmmm, I wonder why it sometimes misses files.  For example, it finds
the German Rheinland-Pfalz history page, but not the English
version:
The program merely looks for associated filenames to make the file
pairings.  It fails here because I didn't know about or anticipate
filenames with language codes used as an infix:
/gene/reg/RHE-PFA/rhein-p-his.html   (E)
 /gene/reg/RHE-PFA/rhein-p-d-his.html (D)
                          ^^ -d in the middle of the filename
The E (English) file isn't found correctly because its name isn't
obviously derivable (at least to the program) from the D filename.  In
fact, the E file isn't found at all because the D file doesn't link to
it, and neither does the parent of the D file.  (The crawler I wrote
only parses one file from a language multiplet; in this case, the
German-language parent was parsed, while the English one was not.)
I could make the crawler look for infixed language tags in the
filenames.  This gives it more chances for false associations,
however, so I would prefer that the German file name be changed to a
simpler
 /gene/reg/RHE-PFA/rhein-p-his-d.html (D)
Another case is (mea culpa)
 /gene/reg/NSAC/schaumburg-lippe_adel.html     (D)
 /gene/reg/NSAC/schaumburg-lippe_nobility.html (E)
and a simple file rename (with accompanying link updates) will fix
this too.
A worse problem, in my opinion, is presented by the sometimes
poorly-chosen page titles.  This can only be cured by careful
attention by the page authors.
-- 
=Jim Eggert   EggertJ@LL.mit.edu

Re: [webpages-l] sitemap_m.html

Richard Heli

Jim Eggert

tags

participants (2)