💾 Archived View for thebackupbox.net › ~epoch › blog › search captured on 2024-12-17 at 10:12:04. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2024-07-09)

-=-=-=-=-=-=-

web, gemini, ftp, open, distributed

I want to close my tabs, and it seems like a waste to just close them.

So here we go.

This started with someone on IRC bringing up an old topic we talk about

every so often. They want some decentralized way to search the internet

and my original reaction was something like, getting site authors to

implement search on their own site using some common API, then users

can search all of those sites through it. Usually my brain went into

the OpenSearch direction where people <link> to xml documents that

define a format for searching the site, with various output formats

that clients can then parse reliably, like rss, or atom.

https://github.com/dewitt/opensearch

Then I was thinking, maybe the site admins could provide the database

they'd be using for their site's search directly, and let the user

search it however many times they want without needing to bother the

site more than once.

Of course then you end up having users needing to download a huge

fulltext search database even if they only want to search once, but

I had remembered seeing someone make a website that would load an

sqlite database bits at a time as needed.

this thing

its live demo

I had mistakenly remembered it having the sqlite database stored in

a torrent. It seems to actually keep the database on ipfs. Though the

orange site comments that I found had some mentions of other things

that this was based on that /did/ store the database in a torrent.

https://news.ycombinator.com/item?id=29920043

https://github.com/lmatteis/torrent-net

https://github.com/bittorrent/sqltorrent

http://bittorrent.org/beps/bep_0046.html

Knowing it was possible to query sqlite databases, I decided the next

step would be to figure out a good format that sites should provide

their self-index in. Preferably something that was already commonly

used. Like, something a crawler would output. I didn't find something

that fit that exactly, but I found a tool that will convert warc files

into an sqlite db and let you query it.

https://github.com/Florents-Tselai/WarcDB

Places that use warc files are like, archive.org, and commoncrawl.

It would have been really nice for archive.org and commoncrawl to

have been using an sqlite based format already. Would have been a

lot easier to plug stuff together and make magic.

example archive.org link that has warc files

commoncrawl

orange site comments on warcdb

I just looked up what yacy uses for their database, it seems to be

some custom thing that might end up being better than warcdb, but

I'm not about to write code to do it.

https://wiki.yacy.net/index.php/En:FAQ#What_kind_of_database_do_you_use.3F_Is_it_fast_enough.3F

After thinking about applying this to gemini, which might be a bit

easier, I remembered that FTP has been doing this kind of thing for

quite a while, by placing ls-lR files in their root directory.

Though not quite the same, because ls-lR is just metadata, and wouldn't

be usable for full-text search. Having some pointer in my robots.txt

to URLs that contain the my own site's self-crawl database would

be a lot easier on my server, especially if other sites are going to

be the interface for end-users to search my site. The format I'm

thinking atm for providing crawl-data of a gemini site is kind of

like... a tgz of message/gemini files named after the request used

to retrieve that gemini response. ofc like, uriescaped since /s in

filenames would cause trouble.

Anyway. All my tabs are closed now.