2007-10-11 Incremental Indexing

I’m sure there’s a better word for it. What I’ve implemented for Oddmuse’s Indexed Search module is the following:

Indexed Search

Until now, indexed search would only know about pages that had been indexed, and indexing is a lengthy process that goes through all pages. That’s why searches never returned any pages written in the last few hours. They were not yet indexed. The new revision of the module adds another index: Whenever a page is saved, it reindexes all the pages changed since the last big reindexing.

I was surprised to learn that incremental updating of a full text index is not “state of the art”. I googled around for a lightweight fulltext indexing library and found Sphinx.

Sphinx

There’s a frequent situation when the total dataset is too big to be reindexed from scratch often, but the amount of new records is rather small. Example: a forum with a 1,000,000 archived posts, but only 1,000 new posts per day. In this case, “live” (almost real time) index updates could be implemented using so called “main+delta” scheme. The idea is to set up two sources and two indexes, with one “main” index for the data which only changes rarely (if ever), and one “delta” for the new documents. In the example above, 1,000,000 archived posts would go to the main index, and newly inserted 1,000 posts/day would go to the delta index. Delta index could then be reindexed very frequently, and the documents can be made available to search in a matter of minutes. – Sphinx manual, 3.8. Live index updates

3.8. Live index updates

Same problem, same solution. Since I already know Search::FreeText, I decided to implement the same strategy using the search enging I already know.

Search::FreeText

For now I have it installed for this site only. There’s still some debugging to do! 😄

And of course I’m curious to see how fast this reindexing after every save is going to be. That might turn out to be a killer... Currently I’ve configured the debug action to show which pages Oddmuse considers to be “new” and will therefore reindex on every edit…

​#Oddmuse

Comments

(Please contact me if you want to remove your comment.)

It seems to work. I’ve just created 2007-10-13 Skills which has been tagged with the RPG tag, and it’s not listed on the Diary front page (which has a journal that excludes all RPG pages). Great!

2007-10-13 Skills

RPG

Diary

– Alex Schroeder 2007-10-14 00:14 UTC

Alex Schroeder