💾 Archived View for gemini.omarpolo.com › post › fts-experiments.gmi captured on 2023-12-28 at 15:30:12. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-01-29)
-=-=-=-=-=-=-
full text search ain't black magic
Written while listening to “Weird Fishes / Arpeggi” by Radiohead.
Published: 2022-04-14
Tagged with:
I'm interested in how full text search (FTS from here onwards) works. FTS is the action of searching a collection of documents for some arbitrary text. I've been using elasticsearch and I remember playing a bit with solr, but nothing beats trying to write something from scratch to understand how things works.
Here I'm doing a walkthrough of what I wrote, which is incredibly naive of course, and explain what choices I made and what I think could be improved, and which parts can be expanded.
To keep it simple I decided to index only a small amount of "documents" — the OpenBSD package collection and the abstract of all the English Wikipedia pages — into a custom database. Later, this database will be used by another utility to do the queries. For simplicity, the database is read-only and needs to be recomputed when the source documents change.
The code is available on codeberg or in the github mirror
https://codeberg.org/op/ftsearch
https://github.com/omar-polo/ftsearch
It was inspired by a previous post I read:
Let's build a Full-Text Search engine
The main thing that I've learned is that FTS is no black magic, of course, but there are so many things you can customize when building a system that it may feel so.
The general idea is to index the text somehow and then using one or more indexes to find the queried text.
What most engines do — or at least I think they do — is to first split the text into tokens (words), then do some transformations on these and finally inserting them into an index where they map in which documents each word appear. Later, most of this procedure is done to the user query too and the index is consulted to find the documents. Or at least this is the abstract idea. In practice bigger engines probably have more than one index.
Some of the common transformations done after the tokenizer are:
Not all of these steps needs to be done in a real engine, and more can be added. I'm limiting to lowercasing only.
The stop words removal is just a workaround for huge indexes and limits the kind of queries a user can do (imagine doing a search for "to be or not to be" or "The Who").
The stemming is an interesting step thought. It's used to map variations of the same word to the same sequence of character, not necessarly a valid word. For example, the Porter algorithm maps argue, argued, argues and arguing to "argu". This way, documents containing only "argued" will match for "arguing" too.
Decent engines will also rank the documents. Ranking means guessing how much a matching documents is relevant to the user query and sorting the results accordingly. Spoiler: mine isn't a decent engine yet so ranking won't be covered here, but hopefully I will manage to in a follow-up post.
After a document has been processed is inserted in some kind of database and then queried somehow.
#define WDELIMS "..." /* everything but a-zA-Z */ char ** tokenize(const char *s) { char *d, *dup, *t, **tok = NULL; void *newtok; size_t cap = 0, len = 0, newcap; if ((dup = strdup(s)) == NULL) return NULL; d = dup; for (t = d; *t; ++t) *t = tolower(*t); while ((t = strsep(&d, WDELIMS)) != NULL) { if (*t == '\0') continue; /* keep the space for a NULL terminator */ if (len+1 >= cap) { newcap = cap * 1.5; if (newcap == 0) newcap = 8; newtok = recallocarray(tok, cap, newcap, sizeof(char *)); if (newtok == NULL) goto err; tok = newtok; cap = newcap; } if ((tok[len++] = strdup(t)) == NULL) goto err; } free(dup); if (tok == NULL) return calloc(1, sizeof(char *)); return tok; err: freetoks(tok); free(dup); return NULL; }
My tokenizer is dead easy: it lowercases the text and splits it into words delimited by anything but a-zA-Z. Thus, the sentence "X11 Window Manager" becomes "x", "window", "manager". As a bonus, assuming the text is UTF-8, other European languages works out of the box.
While the tokenizing step may seems simple, it really isn't generally speaking. Programmatically tokenizing languages like Chinese or Japanese is a real challenge!
My indexer builds an in-memory dictionary of words and documents they appear, which is then used to generate the database.
The design of it is really simple:
struct dict_entry { char *word; int *ids; size_t len; size_t cap; }; struct dictionary { size_t len; size_t cap; struct dict_entry *entries; };
A dictionary is just a sorted array of "dictionary entries": a word and the list of document ids that contains that word.
Inserting a word into the dictionary is straightforward: use binary search to locate the word in the dictionary; if it's already present then just push the document id into its list, otherwise the word needs to be inserted in the dictionary.
Once every document has been processed and its word added to the dictionary, the processing is done. I'm saving all the informations gathered during the previous steps in a custom database to avoid having to re-index all the documents for every search.
I'm not sure how a real FTS engine like Lucene manages the storage on disk, but I went with a really simple approach: a single binary file. At the start there is the index: a sequence of 32 bytes for the word and then the offset at which the list of documents is stored. This allows to just memory-map the database and do a binary search on the index.
Then there are the lists of documents id, one for each word. Each list has a variable length, that's why it's not stored in the main index.
At the end of the file there is another table of document names and descriptions. It's included so that the querying program can just look there to generate nice-looking results listing.
,______.________. | word | offset | | word | offset | | ... | ... | `------'--------' [ids for document #1] [ids for document #2] [ids for document #3] [ids for document #4] ... [document #1 name and description] [document #2 name and description] ...
Finding the documents that match "file manager" then is just a matter of looking up the list of documents that contains "file" and intersecting it with the list of documents for "manager".
This is not the only kind of query that is possible, phrase queries and more general boolean queries are interesting, but for the moment I'm limiting to these simple AND queries.
For the recond, here's the whole `fts' routine:
struct doclist { uint32_t *ids; size_t len; }; int fts(struct db *db, const char *query, db_hit_cb cb, void *data) { struct doclist *xs = NULL; size_t i, len; char **toks, **t; int ret = 0; if ((toks = tokenize(query)) == NULL) return -1; len = 0; for (t = toks; *t != NULL; ++t) len++; if (len == 0) goto done; if ((xs = calloc(len, sizeof(*xs))) == NULL) { freetoks(toks); return -1; } for (i = 0; i < len; ++i) { xs[i].ids = db_word_docs(db, toks[i], &xs[i].len); if (xs[i].ids == NULL || xs[i].len == 0) goto done; } for (;;) { struct db_entry e; uint32_t mdoc; mdoc = xs[0].ids[0]; for (i = 1; i < len; ++i) { if (xs[i].ids[0] > mdoc) goto next; while (xs[i].ids[0] < mdoc) { if (--xs[i].len == 0) goto done; xs[i].ids++; } if (xs[i].ids[0] != mdoc) goto next; } if (db_doc_by_id(db, mdoc, &e) == -1) { ret = -1; goto done; } if (cb(db, &e, data) == -1) { ret = -1; goto done; } next: if (--xs[0].len == 0) goto done; xs[0].ids++; } done: free(xs); freetoks(toks); return ret; }
The (only?) interesting part maybe it's the inner loop. It may smells like black magic, but it's a neat solution to intersecting a variable number of sorted numeric arrays that doesn't use any additional memory. The idea is to pick the first element from the first array and then go thru the other arrays:
This works because the list of documents id is guaranteed to be already sorted.
A possible improvements would be to sort the words for frequency so that the less common one is first: it may reduce the number of times the loop runs.
The repository includes two command line utilities:
Even if the whole implementation is really naive, the results aren't that bad:
% ls -lah db.wiki -rw-r--r-- 1 op op 92.9M Apr 14 11:19 db.wiki % time ./obj/ftsearch -d db.wiki 'small wild cat' https://en.wikipedia.org/wiki/Wildcat Wildcat https://en.wikipedia.org/wiki/Catopuma Catopuma 0m00.02s real 0m00.00s user 0m00.06s system
It took 0.02 seconds to query the 90M database; for comparisons grep takes 0.21 seconds, so it's a nice ~10x speedup!
% time grep cat db.wiki > /dev/null 0m00.21s real 0m00.21s user 0m00.15s system
The selected documents also aren't that bad, let's find for example all the turn-based strategy games in the ports tree:
% time ./obj/ftsearch 'turn-based strategy game' 1oom game engine recreation of Master of Orion 1 freeciv-server Civilization clone for X11; multiplayer; game server freecol free Colonization clone lgeneral turn-based strategy engine ufoai squad-based tactical strategy game wesnoth fantasy turn-based strategy game 0m00.00s real 0m00.00s user 0m00.01s system
Well, these are probably not the only ones but without fancy things like stemming we can only so much and match exact words in the ports comment and description.
However, the database is not really fast to generate: mkftsidx takes more than three minutes to index the 947M enwiki-latest-abstract1.xml Wikipedia dump.
Profiling shows that the 67% of the time is wasted in memmove(3): keeping the dictionary in a sorted array wasn't a great idea after all. Some kind of hash map or tree may lead better results. Also, with huge amounts of data we may not be able to memory map the whole index, so for big collection of documents something like a B+ tree may be the answer.
Further improvements includes:
In a future post I'd like to extend this small framework to cover some simple ranking solutions, and maybe even some simple stemming.
-- text: CC0 1.0; code: public domain (unless specified otherwise). No copyright here.