💾 Archived View for thebackupbox.net › ~epoch › blog › search captured on 2024-12-17 at 10:12:04. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2024-07-09)
-=-=-=-=-=-=-
I want to close my tabs, and it seems like a waste to just close them.
So here we go.
This started with someone on IRC bringing up an old topic we talk about
every so often. They want some decentralized way to search the internet
and my original reaction was something like, getting site authors to
implement search on their own site using some common API, then users
can search all of those sites through it. Usually my brain went into
the OpenSearch direction where people <link> to xml documents that
define a format for searching the site, with various output formats
that clients can then parse reliably, like rss, or atom.
https://github.com/dewitt/opensearch
Then I was thinking, maybe the site admins could provide the database
they'd be using for their site's search directly, and let the user
search it however many times they want without needing to bother the
site more than once.
Of course then you end up having users needing to download a huge
fulltext search database even if they only want to search once, but
I had remembered seeing someone make a website that would load an
sqlite database bits at a time as needed.
I had mistakenly remembered it having the sqlite database stored in
a torrent. It seems to actually keep the database on ipfs. Though the
orange site comments that I found had some mentions of other things
that this was based on that /did/ store the database in a torrent.
https://news.ycombinator.com/item?id=29920043
https://github.com/lmatteis/torrent-net
https://github.com/bittorrent/sqltorrent
http://bittorrent.org/beps/bep_0046.html
Knowing it was possible to query sqlite databases, I decided the next
step would be to figure out a good format that sites should provide
their self-index in. Preferably something that was already commonly
used. Like, something a crawler would output. I didn't find something
that fit that exactly, but I found a tool that will convert warc files
into an sqlite db and let you query it.
https://github.com/Florents-Tselai/WarcDB
Places that use warc files are like, archive.org, and commoncrawl.
It would have been really nice for archive.org and commoncrawl to
have been using an sqlite based format already. Would have been a
lot easier to plug stuff together and make magic.
example archive.org link that has warc files
orange site comments on warcdb
I just looked up what yacy uses for their database, it seems to be
some custom thing that might end up being better than warcdb, but
I'm not about to write code to do it.
https://wiki.yacy.net/index.php/En:FAQ#What_kind_of_database_do_you_use.3F_Is_it_fast_enough.3F
After thinking about applying this to gemini, which might be a bit
easier, I remembered that FTP has been doing this kind of thing for
quite a while, by placing ls-lR files in their root directory.
Though not quite the same, because ls-lR is just metadata, and wouldn't
be usable for full-text search. Having some pointer in my robots.txt
to URLs that contain the my own site's self-crawl database would
be a lot easier on my server, especially if other sites are going to
be the interface for end-users to search my site. The format I'm
thinking atm for providing crawl-data of a gemini site is kind of
like... a tgz of message/gemini files named after the request used
to retrieve that gemini response. ofc like, uriescaped since /s in
filenames would cause trouble.
Anyway. All my tabs are closed now.