2006-10-08 Full Text Index for Oddmuse

I’m wondering whether I should attempt to write full-text index for Oddmuse. Something lean and mean. Sure, it’s reinventing the wheel, and there’s Lucene and other tools, and the current Search::FreeText is not too bad (although I’m overriding lots of stuff, so technically its dependencies are wasted CPU cycles and RAM). Maybe just gut Search::FreeText? Or roll my own? BM25 doesn’t look too difficult to reimplement, if I look at the Search::FreeText source. And put everthing in a Berkley DB file. GNU dbm looks perfect if all I have is one or two indexes: It implements a filesystem-based hash table.

Lucene

BM25

Assume two documents: Document called “Calvin’s Favorite” with content “Weirdoes from another planet” and a document called “Hobbe’s Favorite” with content “Humans from planet Earth”.

First, we assign numbers to documents:

Calvin’s Favorite → 1, Hobbe’s Favorite → 2

Then we tokenize the downcased titles and pages (we want to index titles, too) without using stop words (multilingual stop words?) and create the necessary pointers:

calvin → 1, s→ 1, favorite → 1 & 2, weirdoes → 1, from → 1 & 2, another → 1, planet → 1 & 2, hobbes → 2, humans → 2, earth → 2

Then all I would need is a good implementation of set operations to implement unions (”or”) and intersections (”and”) on the lists returned:

Looking for calvin and earth: 1 ∩ 2 = ∅ Looking for calvin or earth: 1 ∪ 2 = (1, 2) Looking for hoobes and earth: 2 ∩ 2 = (2)

Maybe I could optimize later and use bitvectors... I wonder where I would read up on this kind of thing...

Anyway, it doesn’t look impossible. I just wonder whether this is going to be time well spent. I could spend time thinking about incremental updates of the database, and so on. Or just use a small database. Gah! 😄

There’s a Perl glue library to the C implementation of Lucene. There’s a problem, however: “Currently only ISO 8859-1 (Latin-1) characters are supported. Obviously this included all ASCII characters.” ¹

Hm. I wonder what that means, since Oddmuse continues to be encoding agnostic (it’s all bytes as far as I am concerned). Perhaps it’s not a problem after all.

I also found Plucene, a “Perl port of the Lucene search engine”.

Plucene

I’ve also skimmed to introductory articles for Lucene from ONJava linked to from one of the other pages:

Introduction to Text Indexing with Apache Jakarta Lucene

Advanced Text Indexing with Lucene

Lucene seems like overkill!

And how to index {CJK Chinese, Japanese, Korean} languages remains an unsolved riddle. I need to talk to a Chinese Oddmuse user!

#Oddmuse

Comments

(Please contact me if you want to remove your comment.)

⁂

My understaning is that the largest full-text databases in the world don’t use RDMSs or SQL. So apparently, if you want to scale up, the best choice is to avoid these technologies.

– AaronHawley 2006-10-08 19:35 UTC

AaronHawley

---

I’m relieved to heard that. 😄

– Alex Schroeder 2006-10-08 19:38 UTC

Alex Schroeder