For the past week or so, I've been playing around with search [1] engine [2] optimizations [3] (that last link is so I know what not to do) and poring through the log files.
The last time I made a major search engine optimization to my site was four years ago [4], and the reason for that optimization was to get rid of the disturbing search requests [5] that were plaguing the log files (and my mind) at the time. It also had the added benefit of reducing the amount of “duplicate content” on my site. A search engine like Google [6] would skip indexing the monthly archives (as well as the front page) but would index the individual entries. The end result: no more disturbing search requests, and better results for people actually looking for stuff.
But it didn't reduce all the duplicate content. There was still the small problem of /2000/1/1.1 having the same content as /2000/01/01.1 (note the leading zeros). Technically, they are two separate pages, each with a unique URL (Uniform Resource Locator), although internally, the leading zero is ignored by my blogging engine [7] and it would happily serve up the page under either location.
Now, that particular duplicate content issue is something I've known about since I started writing mod_blog and I had code to distinquish between the two requests, but never wrote the code to do anything about it. Until last week. Now, go to /2000/1/1.1 and you'll get a permanent redirect to /2000/01/01.1. This change should further reduce the amount of “duplicate content” on my site, as well as reduce the number of hits from web spiders indexing my site (although the redirection doesn't happen under a very unique condition, but fixing that pretty much requires a complete overhaul of some very old code, but it's such a seldom used bit of code that I'm not terribly worried about it).
I'm a bit concerned about the spiders because of some other information I've pulled out from the log files. My archive of log files (at least, of this blog [8]) go back to October of 2001 [9] and using some homegrown tools, I generated (with the help of GNUPlot [10]) this graph of the growth of my site over the past six years:
[Graph of traffic growth at The Boston Diaries] [11]
In red, you see the number of raw hits to this site (with the scale along the left hand side), with some explosive growth in early 2006 and again in just the last few months here. In green you see the actual bytes transferred (with its scale along the right hand side)—pretty steady up until January of 2006 when it goes vertical, and again it goes vertical in just the past few months.
And I'm at a loss to the sudden explosion of bandwidth usage in my site. Unless it's a lot of people hot linking [12] to images on this site (and yes, that does happen quite often), or a vast increase in the number of spiders indexing my site (and for the past few months, Yahoo's [13] Slurp [14] has been generating about 40,000 hits a month).
I may no longer have disturbing search requests, but I know have a disturbing use of bandwidth.
[1] http://seo-theory.com/wordpress/
[5] http://www.disturbingsearchrequests.com/
[7] https://boston.conman.org/about/
[8] https://boston.conman.org/
[11] /boston/2007/08/13/growth.png
[12] http://altlab.com/hotlinking.html