2023-09-17 How to index pages for Oddµ

These last few days I've been working a lot on Oddµ. It's weird. I
stopped programming at the office. I write documentation for
programmers, now. But at home, it's code code code…

@maxc@merveilles.town was asking about wiki features and @neauoire@merveilles.town mentioned backlinks. I was never a big friend of backlinks, however.

I think in a traditional hypertext where all nodes are equal and there is a topic with related pages, my expectation is that the links are already there. For blog-like hypertext, the tags take on this role.

But even when you’re using tags, the situation is not great. If you look at all past and potentially superseded pages, they will have backlinks from the current pages pointing back at them. I write a new page and link to the past pages I’m building upon. The page I’m currently writing, however, cannot have backlinks. That, however isn’t possible. I have to go back to those old pages and post an update: I need to edit them and link to the new page that extends, updates or supersedes them. Those links from old pages to related new pages have to be added by hand. Backlinking doesn’t do that (and tags are very limited).

Anyway, back when I had backlinks on my wikis, I never used them and now that I don’t have them, I don’t miss them. But I use search all the time. I often need to find just the right to add a link to a post. How do I find it? I search on my own blog for the page I know I have written at some point in the past. This needs to be fast.

I’m at over 6000 pages and just wrote a new wiki engine, Oddµ. This time around I prioritized search.

Oddµ

Search is super important to me. I went through a lot of iterations: Oddmuse (a different wiki of mine) first opened all the pages in Perl and searched them; then for a while I tried to use an index that did some scoring but was really unhapy; then I just used `grep` to filter the pages because it was so fast; then I added hashtags in a two-way Berkley DB to be even faster… But to be honest, what I used most of all was a simple string matching on page titles (and in my case page titles were file names). That was super fast and the page names were always in memory. Oddmu (the new wiki in Go) does regular Markdown parsing and now the filenames are no longer page titles.

Oddmuse

Just now I implemented page title parsing at startup and worked on sorting again. It started with me looking at how the trigram index I was using worked. If indexing (basically reading all the files and creating the index) is quicker than using the index and opening the resulting files, something is wrong:

Indexed 6528 pages in 11.55961129s
Search for crucible found 28 pages in 824.569287ms
Search for freya found 49 pages in 1.554338733s
Search for #software found 327 pages in 2.118936177s
Search for #life found 279 pages in 1.905384854s
Search for #rpg found 1507 pages in 14.353306827s

I added some print statements and it seems the problem is not the index but the loading and summarizing of the pages: computing a score, highlighting snippets matching the query string, that kind of stuff.

Looks maybe I should have pagination after all! 😅

Search for #rpg found 1507 pages in 358.133µs
Loading and summarizing 1507 pages took 16.468620929s
Sorting 1507 pages took 3.209779ms

I added pagination but since I still wanted to sort by score I had to load all the 1500+ pages and score them, before I could determine which pages to display. I did only compute the summary and the language for those pages I needed to display but that just cut the load time by half. And waiting for 7–8 seconds is not cool.

So now I'm back to determining which pages to show without scoring all of them. It works quite well in my case where I mostly think about the "recent" blog post. As the blog posts start with an ISO date, the solution is to sort the pages with an ISO date first, in descending order, and the remaining pages in lexical order, ascending.

But now I'm being bitten once again by that recent decision to decouple filenames from page names: the lexical sorting is by filename since I don't want to load the pages from disc, and once the two are decoupled the sorting is weird. All my options seem uncool.

I decided to keep the real titles in memory at all times. It's easy to do since I'm already indexing all the pages at startup. But now I'm still running into the problem of decoupling page titles from filenames. If the page title no longer starts with a date (even if the filename does), then the page isn’t considered part of the blog!

I guess the best solution would be for the sorting by date to use the filename, but for the lexical sorting to use the page titles. That basically depends on the fact that very few blog posts are written on the same day. Works for me! Now I just need to find the time to implement that.

​#Wikis ​#Oddµ

I do use hashtags and these link to searches for the hashtag, with the first page shown being the *hashtag page* (a page with the same name as the hashtag) that contains a *forward index* (a list of links to all the pages with the hashtag) of *blog pages* (pages with a name starting with a date); so that part feels a bit like gopher menus.

Also, all blog pages are linked to from the front page for the current year (which then gets moved to archival pages over the years); that part feels a bit like a blog.

In addition to that, any page edit deemed significant enough is tracked on the changes page (a bit like Recent Changes, except a static page that can be edited); that part feels a bit like a wiki.

changes

This intentional linking models how I move around, looking for connections. I admit that the idea of all links going two way is fascinating in theory, but in practice, you end up with a list of half a dozen page names at the bottom or in the sidebar and you don't have enough context. Is the topic expanded upon? Contradicted? Revised? Or is this a harmless "see also" link? In all of these cases, it would be preferable to actually go back to the old pages and explicitly link forward in time to newer, related pages. With some context so readers can decide whether they want to follow the links.

And with that, we're back to forward indexes.