Indexing weblogs

I see that Nick Denton [1] is launching a new venture [2] that seems to be centered around marketing and weblog indexing; specifically, thoughts about weblog indexing.

I've talked [3] about this a bit, but if a dedicated search engine wants to successfully scan a weblog there are a few ways to go about it.

One, grab the RSS (Rich Site Summary) [4] file for the weblog and index the links from that. That will allow you to populate the search engine with the permanent links for the entries. Another thing it will allow you to do is properly index the appropriate entries. Google [5] does a good job of indexing pages, but a rather poor one of indexing individual entries of a weblog, since it generally views pages as one entity and not as a possible collection of entities. So that if I mention say, “hot dogs” on the first of the month, “wet papertowels” on the fifteenth and “ugly gargoyles at Notre Dame” on the last day of the month, someone looking for “hot wet gargoyles” at Google [6] is going to find the page that archives that month.

Which is probably not what I, nor the searcher in question, want.

Well, unless I'm looking for disturbing search request [7] material, but I digress.

Even if the permanent links point to a portion of a page, the link would be something like

http://www.example.net/200204-index.html#31415926

Which points to a part of the page at

http://www.example.net/200204-index.html

And somewhere on that page is an anchor tag with the ID of “31415926” which is most likely at the top of the entry in question. From there you index until you hit the next named anchor tag that matches another entry in the RSS (Rich Site Summary) file.

And if you hit a site like mine, the RSS (Rich Site Summary) [8] file will have links that bring up individual pages for each entry.

Now, you might still have to contend with a weblog that doesn't have an Rich Site Summary (Rich Site Summary) file, but then, you could just fall back to indexing between named anchor points anyway and use heuristics to figure out what may be the permanent links to index under.

I'm sure that people looking for “hot wet gargoyles” will thank you.

[1] http://www.nickdenton.org/

[2] http://www.nickdenton.org/newventure.htm

[3] /boston/2002/03/28.1

[4] http://www.webreference.com/xml/column13/

[5] http://www.google.com/

[6] http://www.google.com/search?hl=en&q=hot+wet+gargoyles

[7] http://searchrequests.weblogs.com/

[8] https://boston.conman.org/bostondiaries.rss

Gemini Mention this post

Contact the author