I’m wondering whether I should attempt to write full-text index for Oddmuse. Something lean and mean. Sure, it’s reinventing the wheel, and there’s Lucene and other tools, and the current Search::FreeText is not too bad (although I’m overriding lots of stuff, so technically its dependencies are wasted CPU cycles and RAM). Maybe just gut Search::FreeText? Or roll my own? BM25 doesn’t look too difficult to reimplement, if I look at the Search::FreeText source. And put everthing in a Berkley DB file. GNU dbm looks perfect if all I have is one or two indexes: It implements a filesystem-based hash table.
Assume two documents: Document called “Calvin’s Favorite” with content “Weirdoes from another planet” and a document called “Hobbe’s Favorite” with content “Humans from planet Earth”.
First, we assign numbers to documents:
Calvin’s Favorite → 1, Hobbe’s Favorite → 2
Then we tokenize the downcased titles and pages (we want to index titles, too) without using stop words (multilingual stop words?) and create the necessary pointers:
calvin → 1, s→ 1, favorite → 1 & 2, weirdoes → 1, from → 1 & 2, another → 1, planet → 1 & 2, hobbes → 2, humans → 2, earth → 2
Then all I would need is a good implementation of set operations to implement unions (”or”) and intersections (”and”) on the lists returned:
Looking for calvin and earth: 1 ∩ 2 = ∅ Looking for calvin or earth: 1 ∪ 2 = (1, 2) Looking for hoobes and earth: 2 ∩ 2 = (2)
Maybe I could optimize later and use bitvectors... I wonder where I would read up on this kind of thing...
Anyway, it doesn’t look impossible. I just wonder whether this is going to be time well spent. I could spend time thinking about incremental updates of the database, and so on. Or just use a small database. Gah! 😄
There’s a Perl glue library to the C implementation of Lucene. There’s a problem, however: “Currently only ISO 8859-1 (Latin-1) characters are supported. Obviously this included all ASCII characters.” ¹
Hm. I wonder what that means, since Oddmuse continues to be encoding agnostic (it’s all bytes as far as I am concerned). Perhaps it’s not a problem after all.
I also found Plucene, a “Perl port of the Lucene search engine”.
I’ve also skimmed to introductory articles for Lucene from ONJava linked to from one of the other pages:
Introduction to Text Indexing with Apache Jakarta Lucene
Advanced Text Indexing with Lucene
Lucene seems like overkill!
And how to index {CJK Chinese, Japanese, Korean} languages remains an unsolved riddle. I need to talk to a Chinese Oddmuse user!
#Oddmuse
(Please contact me if you want to remove your comment.)
⁂
My understaning is that the largest full-text databases in the world don’t use RDMSs or SQL. So apparently, if you want to scale up, the best choice is to avoid these technologies.
– AaronHawley 2006-10-08 19:35 UTC
---
I’m relieved to heard that. 😄
– Alex Schroeder 2006-10-08 19:38 UTC