2023-03-07 A vision for search

It’s early in the morning, my wife left for her commute and the Duke Ellington band is playing. I’m between first coffee and breakfast.

If I were to reinvent Search, where would I start? Page rank invites spam, we already know that. Endless crawling leads to the majority of web traffic being bots, we also know that. Crawling is also invasive, pulling things into the public square that were not intended for publication. So what are we to do?

Here’s a different plan. Perhaps we need more different plans.

Let there be a site map that contains all the pages you would like strangers to find, with the keywords associated with them.

This compressed JSON file (better than XML, right?) would be a well-known path for every site, like “/.well-known/index.json.xz”. It lists all the pages on your domain, and their keywords.

Perhaps we can use RFC 5005 ideas and have chains of these index files with updates to previous files in order to save bandwidth, or the entries come in order so you can parse them as a stream and stop when you get to the time stamps you already have in your index. These “other” files could also be the per-user files on a shared hosting system.

{
  "version": "1",
  "contact": "https://alexschroeder.ch/wiki/Contact",
  "base": "https://alexschroeder.ch",
  "others": [],
  "index": [
    {
      "ts": "2023-03-07T07:17:00Z",
      "title": "Diary",
      "path": "/wiki/Diary",
      "keywords": ["Diary", "Alex", "Schroeder"]
    }
  ]
}

“contact” could be any sort of URL including a mail URL.

“base” can contain a path and the “path” is appended to it, so that it works for the shared hosting case.

“others” are links to other files. Type “include” means it’s part of our site. Type “trust” means we trust this other site. We could model it as a trust “number” but I think it’d be more interesting to assign semantics. To “include” a thing says you take responsibility for the content whereas “trust” means you just think they’re cool but there’s no contact.

  "others": [
    {
      "type": "include",
      "url": "https://alexschroeder.ch/software/index.json.xz"
    },
    {
      "type": "trust",
      "url": "https://en.wikipedia.org/index.json.xz"
    }
  ],

The hash per page should have some extra fields: MIME-type, size, description (with plain text):

    {
      "ts": "2023-03-07T07:17:00Z",
      "title": "Wall of Text",
      "path": "/wiki/wall",
      "content-type": "text/html; charset=utf-8",
      "content-length": 57824,
      "description": "Welcome to my Wall of Text! 🙂"
      "keywords": ["Diary", "Alex", "Schroeder"]
    }

I’m not afraid of content stuffing nor spam because we’ll defederate from bad actors and soon enough there will be lists of known bad actors, and so on. Lessons learned from the fediverse.

The components we have are therefore:

Local indexers that create these files.
Spiders fetch these files like feeds.
Search is a local program that makes sense of these collections for users.
Some sort of collective trust assigning that is better than just whitelisting and publishing these whitelists.

Search being local would allow us to try different algorithms.

Indexing being local would allow us to limit stop the endless crawler churn.

Compared to regular search, the Silicon Bro approach of “break stuff first, apologize later” is unacceptable to me. The Silicon Bro approach is why Google succeeded: It didn’t need nor ask for permission to crawl. But in our human lives, you can’t just enter an unlocked home and take whatever you want. Online, this is the bro mindset. “It was public!” they, unable to comprehend that there are nuances and that the presence or absence of a lock is just the most primitive lense through which to view reality.

Instead of taking the Silicon Bro approach, this proposal depends on large-scale cooperation. People put up their indexes with their titles and their keywords.

Then again, with shared hosting being so big, perhaps it just needs a few big sites to join forces. We could kick it off by creating a Wikipedia index. I find most of my information on Wikipedia, after all. Then add our own sites to it. Like in the old days, when people submitted their sites to Google: We’d need that submission link.

And yes of course a central authority that accepts all submissions and trust them all is going to be nice and free at first but then it’ll be all about spam fighting and the vetting will cost money that has to be earned somehow. How about we don’t create this central authority, then? Let’s grow organically.

Sure, we lost out to Google and Bing, but we can crawl back out from the salty waters of the Internet ocean and try again. Their ad-driven business models won’t last forever.

Later, we have to find a way to link into the system:

As publishers we need to find a public index that trusts us and links to our index.
As readers we need to find a search engine that trusts and pulls the right indexes so our search results are good.

It might work.

*2024-08-16**. Actually, we could just use JSON Feed to export the parts of the site that we want to export.

JSON Feed

#Programming #Search #Federated Opt-In Search

Comments

(Please contact me if you want to remove your comment.)

⁂

This post got me thinking about “a personal search engine” again. Thanks for the inspiration, Prototyping a Personal Search Engine.

Prototyping a Personal Search Engine

– R. S. Doiel 2023-03-08 07:08 UTC

R. S. Doiel

---

Here is my own solution

– Sandra 2023-03-08 07:59 UTC

---

*2023-03-11**. Also note this interesting set of tools (via Gopher):

Quarry contains a number of components:

Crawler/indexer (quarry.pl)Gopher search, front end to search index (search.dcgi)Wrapper for quarry.pl to process pending host index requests (indexPending.pl)Sitemap generator (generateSitemap.pl)Host and selector maintenance (checkHosts.pl)

…

The reasons for the sitemap are twofold:

Efficiency, downloading a single index file rather than crawling.The format supports additional metadata: Description, Categories, Keywords

These extra metadata fields can be used to greatly enhance search results. – Quarry

Quarry

*2024-08-16**. Reimplementation using JSON Feed for indexes.

JSON Feed for indexes