A google spiders

In checking the log files for this site [1] I've notived that Google [2] has finally found it and has spent the past few days spidering through it.

There are a few thousand links for it to follow (out of what? A million potential URL (Universal Resource Locator)s on this site? I know the Electric King James [3] has over fifteen million URLs [4]). For instance, there are three just for the years, 12 each for each year (okay, so there's only 11 for this year, but close enough) so that's now 39 URLs. Each day (for those days that have an entry) have at least one entry and while I may have skipped a day or two here and there, let's say there's an averave of 300 per year, so that's over 900 there. And if you assume an average of two entries per day (remember, you can retrieve the entire day, or just an entry) that's another 600 per year or 1,800 so we're now up to nearly 3,000 URLs that Google has to crawl through (with lots of duplication).

robots.txt for bible.conman.org [5]

#-----------------------------
# Go away---we don't want you 
# to endlessly spider this
# site.
#-----------------------------

User-agent: *
Disallow: /

There's a reason I don't allow web robots/spiders to the Electric King James [6]—it would take way too long to index the site (if indeed, the spider in question was even aware of all the possible URLs) and my machine isn't all that powerful to begin with (it being a 33MHz 486 and all). But I feel that there is a research problem lurking here that some interprising Masters or Ph.D. candidate could tackle: how best to spider a site that allows multiple views per document.

[1] http://boston.conman.org/

[2] http://www.google.com/

[3] http://literature.conman.org/bible/

[4] /boston/2000/08/31.2

[5] http://bible.conman.org/

[6] http://literature.conman.org/bible/

Gemini Mention this post

Contact the author