💾 Archived View for rawtext.club › ~sloum › geminilist › 005717.gmi captured on 2023-09-28 at 17:04:24. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-11-30)
-=-=-=-=-=-=-
Natalie Pendragon natpen at natpen.net
Sat Feb 27 13:48:44 GMT 2021
- - - - - - - - - - - - - - - - - - -
I can't believe GUS has been around for over a year at this point.There was one fairly substantial refactor about 4-5 months in, but thearchitecture is still *mostly* the same as it was on day one. I thinkthat's kind of cool, because that means GUS has mostly survived a bitover an order of magnitude of Geminispace growth - the first GUS crawlindexed 26 domains! I'm not sure if you were looking for the story ofGUS, or an explanation of "why not Elasticsearch," but I'll tell thestory anyway in case it's interesting to anyone :)
GUS has two parts (similar to how Stephane broke down the problem inanother response): there's the crawler and there's the searcher. Inthe beginning, the crawler wrote directly to a TF-IDF index as itcrawled. Then, the searcher, when run later, was simply a reader andsearcher of that same index.
On Sat, Feb 27, 2021 at 10:21:18AM +0100, Côme Chilliet wrote:
I was kind of expecting to see a solution based on an existing
search engine to emerge, such as elastic search, by implementing
only the gemini specific parts, but I looked into quite a few
project and all were terribly complicated…
It's interesting to note that, although I chose to build slightlycloser to the metal than, say, Elasticsearch, it's still basically thesame tech. Elasticsearch wraps niceties around Lucene, which is alibrary for doing TF-IDF development. I chose to use a differentTF-IDF library directly in GUS - I thought for fun, I would try onethat was native to Python, to keep the project as self-contained aspossible. I found Whoosh [1], and while it's had a few sharp edges,it's mostly been great to work with.
[1] https://whoosh.readthedocs.io/
Onto the big refactor I mentioned happened 4-5 months in. TF-IDFindexes I would best describe as "fragile." You have to be carefulabout interrupting any writes, you don't get easy transactions orrollbacks like you do with most relational databases. This becameincreasingly frustrating to have to deal with as part of the crawl,which (again, as Stephane) is a difficult beast to manage in the realworld. Capsules get creative :) So I added one extra piece to GUS'architecture - a SQLite database. The crawl writes to the SQLitedatabase, then the TF-IDF index is built afterwards, from what's inthe database.
I think GUS is about due for another rearchitecture of the crawler. Alot of the recent hacking on GUS is being done by a few folks likeRene with geminispace.info, so I would also be curious what theirideas are, but I think there could be a lot of promise taking asimilar approach to what mozz did for his Gemini archiver [2].
[2] https://github.com/michael-lazar/mozz-archiver/
I also think the WARC output format is cool. It makes me think of aworld in which maybe we could even get rid of the need for crawling -just provide tooling that folks can use to package their own capsuleinto a WARC, then they submit it. It would be a more efficient use ofnetwork bandwidth than crawling is, and it would also naturally beopt-in, which after lots of reflection, I think would be a niceproperty for a place like Gemini.
Well that's the story of GUS. Let me know if you have any questions orwant me to expound on any of this. I will close by saying I am veryexcited and thankful to see new folks like Rene and Remco get soinvolved with this project, and I am really looking forward to thefuture of search in Gemini.
Warm regards,Natalie