2024-08-18 Indie search for Oddmu

This is a follow-up for 2024-08-16 JSON feed for indexing where I linked to IndieSearch, byJP. I want to see what's required for this to work.

2024-08-16 JSON feed for indexing

IndieSearch, byJP

First, I need to install Pagefind. Lucky me, I already have a Rust build environment installed.

cargo install pagefind

Next, I need a static HTML copy of my site:

env ODDMU_LANGUAGES=de,en  oddmu static -jobs 3 /tmp/alex

Create the index:

pagefind --site /tmp/alex

Upload:

mv /tmp/alex/pagefind .
make upload

It's now available at `https://alexschroeder.ch/wiki/pagefind`.

Adding the info to the page header:

<link rel="search" type="application/pagefind" href="/wiki/pagefind" title="Alex Schroeder’s Diary">
<link href="/wiki/pagefind/pagefind-ui.css" rel="stylesheet">
<script src="/wiki/pagefind/pagefind-ui.js"></script>
<script>
window.addEventListener('DOMContentLoaded', (event) => {
  new PagefindUI({ element: "#pagefind", showSubResults: true });
});
</script>

And a div to the page body:

<div id="pagefind"></div>

And I had to add `'unsafe-eval'` to the `script-src` Content-Security-Policy header.

Too bad the search result links all end in `.html` … that requires an extra rewrite rule, to get rid of. On the other hand, it also allowed me to get rid of `baseUrl: "/view/"`.

Since there are still rough edges, this search is only available via ~~Pagefind~~.

​#Search ​#Oddµ

Correct — except it’s many files, organised so as to allow minimal transfer for a single search query — but all can be retrieved to have a full local index

Correct — they can be retrieved/cached to query locally, or queried efficiently remotely.

Yes. Many can be kept locally or remotely. The sites you “register”/add become your search index.

Yes (with minor hackery). Each Pagefind instance *is* the index, a small bootstrapping JS/WASM blob, and ~5 lines of init JS. Those 5 lines can include locations of multiple indexes for the (exclusively client-side) JS to query when searching; by default the only one configured is the “local” index that created with the Pagefind JS/WASM blob, but you could strip away an empty index and have what you describe — in fact, that’s what my IndieSearch demo does!

IndieSearch

I haven’t built that (yet), but yeah — I’d want to build an mf2/IndieWeb compatible “new user experience” that’d guide folks to finding a good-for-them set of defaults. My demo automatically (provisionally) adds any site you visit that supports IndieSearch, so you’d get better coverage fast. I’d also consider default-importing from blogrolls and the like. (Management & performance gets tricky with 1000s of indexes. Probs an “if we get there” problem. 😅)

Not yet, but I asked exactly this question of the Pagefind devs and they offered that it’s currently too hard, but that there is a potential route they'd consider, see #564.

#564

I'm still thinking about the situation where I'm part of a community and we want to all share search, like using Lieu for a webring – except I'd like a solution where I don't have to do the crawling.

Lieu

Sadly, the communities I'm part of are planets such as RPG Planet or Planet Emacslife, each with hundreds of blogs. I suppose most of them don't offer a Pagefind index, being hosted on Blogspot and Wordpress, but what I'm considering is ingesting their feed and indexing it. This could be a service I could perform for people. Luckily, it's often possible to get all the blog pages via the feed. This is how I've made local backups of other people's blogs. I guess that each site would be a separate index, however?

RPG Planet

Planet Emacslife

local backups of other people's blogs

I'd like to find a way that doesn't require me to always download all the pages. I'd like to find a way to update the index as it's being used.

I'd need a way to figure out how to configure it such that the results link back to the original pages, of course.

Another thing I'm considering is that my own site is rendered live, from Markdown files… it's not a static site. So ideally I'd be able to ingest Markdown files directly. Or I can go the route of exporting it all into a big feed and ingesting that, once I've solved the problem above, I guess. But the problem above might also be easier to solve by extracting HTML pages from the feed. It's what I've done in the past. Create something that works, first, then improve it later?

Anyway, ideas are swirling around.

The sort order of the results is less than ideal, for example. I like to emphasize more recent blog posts. Pagefind, however, returns them in some sort of scoring order so that I've seen quite a few results with pages around 20 years old.

Image previews seem to rarely work. I suspect the problem is pages in subfolders linking to images in that same subfolder. Such relative links don't need a path – but they do if Pagefind is not in the same directory.

Here's another thing to consider: The index takes up more space than the full HTML of the entire site, compressed!

alex@sibirocobombus ~> du -sh alexschroeder.ch/wiki/pagefind alexschroeder.ch/wiki/feed.json.gz 
 43M	alexschroeder.ch/wiki/pagefind
9.7M	alexschroeder.ch/wiki/feed.json.gz

So the Pagefind index takes about 4× more space than the full HTML, for this site. This matches my experience with better indexing for my own site where I experimented with full text indexes and trigram indexes. Back then:

the 15 MiB of markdown files seem to have generated an index of 70 MiB – 2023-09-11 Oddµ memory consumption

2023-09-11 Oddµ memory consumption

Of course, in terms of copyright incentives, handing off the entire site like that is tricky. Doing it with a feed feels OK. Doing it for a search engine seems like handing the keys to Google. This provides an incentive to use a pre-computed index.

It also reminds me that the idea I had of building a search engine out of feed slurping without consent is probably a bad idea – like all ideas based on non-consensual acts.

Whether self-indexing is a good thing in terms of avoiding an English-first focus I don't know. I suspect that most people will be using free software and therefore there's no reason to suspect that a search engine doesn't have the means to process the languages. Then again, that's a lock-in where in order to support a new language, you have to support the software your favourite search-engine supports. So people indexing our own pages might have long term benefits.

I'm still wondering about the comparison of Pagefind and Lunr, to be honest. How many such static search solutions are there? Is there a benefit of one implementation over another?

I guess now I should look into Pagefind some more? Indexing non-HTML pages, handling image previews for pages not in the root directory and relative image source URLs, the sort order of results, the use of the .html extension in results… there are still rough edges as far as I am concerned – and per discussion above the onus is on me to fix my indexing. 😭

First, a script to find an IP number that is doing many requests: leech-detector.

leech-detector

root@sibirocobombus ~# tail -n 10000 /var/log/apache2/access.log \
  | bin/admin/leech-detector \
  | head
Total hits: 10000
IP                             |       Hits | Bandw. | Rel. | Interv. | Status
------------------------------:|-----------:|-------:|-----:|--------:|-------
                185.103.225.81 |       1057 |     0K |  10% |    1.7s | 302 (50%), 401 (49%), 200 (0%)
…

root@sibirocobombus ~# whois 185.103.225.81
…
org-name:       Muth Citynetz Halle GmbH
…

I have never heard of them. Sounds like a small hosting company, responsible for 10% of all hits on my server, right now. Let's print the status code they received from my web server and the path they requested.

root@sibirocobombus ~# grep 185.103.225.81 /var/log/apache2/access.log \
  | awk -e '{print $10, $8}' \
  | sort \
  | uniq -c
    524 /edit/logo.jpg
    529 /view/logo.jpg
      2 /view/RPG.rss
      2 /wiki/feed/full/RPG

This is simply broken shit that cannot be fixed by caching.

OK, let’s pick another IP number. Let’s look for an IP number that is getting a lot of 200 results but isn’t accessing my fedi instance:

root@sibirocobombus ~# grep -v '^social' /var/log/apache2/access.log \
  | grep ' 200 ' \
  | bin/admin/leech-detector \
  | head
…
                51.210.214.160 |        411 |     4K |   0% |  155.8s | 200 (100%)
…

root@sibirocobombus ~# whois 51.210.214.160
…
org-name:       OVH SAS
…

This is totally a hosting provider. What are they hosting?

root@sibirocobombus ~# grep 51.210.214.160 /var/log/apache2/access.log \
  | awk -e '{print $10, $8}' \
  | sort \
  | uniq -c
…
      1 200 /wiki?action=rss;rcidonly=InterWiki
      1 200 /wiki?action=rss;rcidonly=InvestmentWebOfTrust
      1 200 /wiki?action=rss;rcidonly=IrcTranscription
      1 200 /wiki?action=rss;rcidonly=JohnCappiello
      1 200 /wiki?action=rss;rcidonly=JulianKrause
      1 200 /wiki?action=rss;rcidonly=JustinKao
      1 200 /wiki?action=rss;rcidonly=KarlDubost
      1 200 /wiki?action=rss;rcidonly=KarlTree
      1 200 /wiki?action=rss;rcidonly=KurzAnleitung
      1 200 /wiki?action=rss;rcidonly=LangageClair
…

For every single wiki page it learnt about its feed from meta data and it indexing that. I don’t know what to do. Producing these feeds is expensive. It can be a interesting service for humans that want to monitor a page or two. It’s a waste of CO₂ for machines that want to pre-emptively ingest these pages. And since there are thousands of these pages, every search bot ends up storing all the URLs, calling the wiki engine every time to ask for the latest change, making it load the page and check the metadata and send the reply, and TLS handshakes and cryptography on top of all that. It hurts my heart to think of all this waste. I think this is terrible design that cannot be fixed by adding caching. My wiki does client side caching (pages are said to remain fresh for 10s); http caching (with ETag header); html caching (attempting to reduce the time spent parsing the wiki sources and generating the html). In a way, these bots are forcing me to abandon dynamically generated content on the open net – or to waste more time and energy for squid cashes or varnish, more RAM, more resources, more of my time. And that’s what making me so angry. Dynamically generated HTML is excellent for low traffic sites. These days, there seems to be no more low traffic because robot traffic is so high. They keep the caches warm, all the time, on a global scale.

I remain unhappy.