2024-08-16 JSON feed for indexing

Recently, @dredmorbius@toot.cat wrote about Google and search and posed the question:

What if websites indexed their own content, and published a permuted index in a standard format, in a cache-and-forward model similar to how DNS works?

A while ago I wondered about self-published indexes. We have software to generate feeds. Why not software to generate indexes? Back then I proposed a JSON format. Today I finally took a look at JSON Feed. I think it has everything we need.

self-published indexes

JSON Feed

Take a look at the example for this site: .well-known/search-feed.json.gz. This file has about 9.7MiB. The source material is 6740 Markdown pages, a total of about 21.5MiB, or 8.4MiB compressed.

.well-known/search-feed.json.gz

Using the `next_url` attribute, it would be possible to split this file up into chunks of 100 pages each, or a chunk per year, if the platform promises that older pages never change. This wouldn't work for my wiki, but perhaps it would for certain platforms.

Somebody will have write up a best-practice on how to use HTTP headers to avoid downloading the whole file when nothing has changed. Sadly, section 13 of RFC 2616 is pretty convoluted. Basically something about the use of If-Modified-Since and ETags headers.

section 13 of RFC 2616

We also need to agree on how to use some of the JSON Feed attributes.

`content_text`: If used, all markup should be ignored by the server (no guessing whether the text is Markdown or not). It would be fine if this contained just the text nodes of the HTML, separated by spaces. This can be useful to see whether words occur in the text, how frequent they are, etc.

`content_html`: This is the preferred way to include pages in the index. It is up to the search engine provider to extract useful information from the HTML, including summaries, extracts, scoring, etc. It is up to index providers to provide the kind of HTML they think serves search engines best. This includes using the semantic HTML tags and dropping style, scripts, footers and other elements that might be used by search engines to reduce the relevance of the item.

I also find RFC 5005 to be very instructive in how to think about feeds for archiving.

RFC 5005

@jonny@neuromatch.social commented, saying that it was important to think about how search indexes were going to be used:

so given that no single machine would or should store a whole index of the internet or even all your local internet, you can go a few ways with that, take the global quorum sensing path and you get a bigass global dht-like thing like ipfs. if instead you think there should be some structure, then you need proximity. is that social proximity where we swap indexes between people we know? or webring like proximity dependent on pages linking to each other and mutually indexing their neighborhood?

It's an interesting question but I think I want incremental improvements to the current situation. So if a person has a website right now, on server, what's the simplest thing they can do so that they aren't drowned in crawlers and can still be found via search? That would be publishing an index, analogous to publishing a feed. Having more search engines (even if using legacy centralized architecture) would be better than what we have now. Not depending on crawlers would be better what we have now.

In terms of decentralisation, I think I like community search engines like lieu. The idea is great: a community lists a bunch of sites. Lieu generates a web ring and crawls them to build an index of all the member sites. Instead of crawling, it could fetch the indexes. This would be much better than what it does right now, because right now, lieu uses colly for crawling and colly ignores robots.txt. This means that lieu instantly bans itself when it visits my site because it's not rate limited. It's just an implementation detail, but sadly I am biased. I've been on a Butlerian Jihad since 2009 when I discover that over 30% of all requests I serve from my sites are for machines, not humans.

lieu instantly bans itself when it visits my site

over 30% of all requests I serve from my sites are for machines, not humans

It makes me want to raise my keyboard and scream "CO₂ for the CO₂ god!!"

Somebody should draw a Hacker Elric doing that, standing on a mountain of electro-trash with the burnt and dead landscape of the post-apocalypse in the background.

But back to the problem of indexing. Right now, search engine operators and their parasites, the search engine optimisation enterprises, crawl every single page including page histories, page diffs, and more, on my wikis. If every wanna-be search engine downloaded my index once a day, I would be saving resources. Whether that's a step in the right direction, I don't know.

@jonny@neuromatch.social also said:

i just think that the ability to fundamentally depart from the commercial structure of the web and all its brokenness doesn't happen gradually and esp. not with the server/client stack we have now

Indeed, there must be another way. I just don't see it, right now. It's always hard to imagine a new world while you're still living in the old one. I'm sure the solution will seem obvious to the next generation, looking back.

#Search #Feeds #Butlerian Jihad

*2024-08-17**. @splitbrain@octodon.social suggested that this was the same as the sitemaps.org concept, but I think the sitemap lists the URLs available and you still have to crawl the site. At least now you never miss a page.

Would having such an index generate too much traffic? I think it would worm if all other crawling would stop. That’s the necessary trade-off, of course. No stupid crawlers lost in my wiki admin links (history pages, diffs) wasting resources – that is my goal. Also, since I ask for a crawl delay, crawling has to reconnect all the time, negotiate TLS all the time, start up CGI scripts all the time… having basically a static snapshot for download would obviate that (in a world where crawlers are smart enough to understand that the don’t need to crawl).

*2024-08-18**. @zens@merveilles.town mentioned Lunr but then suggested that @byjp@hachyderm.io's solution to the problem is even better: the server provides indexes that are used client-side and a browser extension allows querying multiple such indexes, locally.

Lunr.js is a small, full-text search library for use in the browser. It indexes JSON documents and provides a simple search interface for retrieving documents that best match text queries.

After indexing, Pagefind adds a static search bundle to your built files, which exposes a JavaScript search API that can be used anywhere on your site. Pagefind also provides a prebuilt UI that can be used with no configuration. – Pagefind

Any time they visit the IndieSearch homepage (a page served from their browser extension) they can now search all the sites supporting IndieSearch they've visited and/or included. – IndieSearch, byJP

Pagefind

IndieSearch, byJP