πΎ Archived View for gemi.dev βΊ gemini-mailing-list βΊ 000645.gmi captured on 2023-12-28 at 15:50:31. Gemini links have been rewritten to link to archived content
β¬ οΈ Previous capture (2023-11-04)
-=-=-=-=-=-=-
Hi, after a few days of public testing i'm happy to announce the launch of geminispace.info, a new search provider for the Gemini protocol: gemini://geminispace.info This is not about reinventing the wheel, geminispace.info is a independent instance of the great GUS by Natalie Pendragon and contributors: https://natpen.net/code/gus This is NOT a fork of GUS, i intend to send patches to upstream if appropiate. First patches has already been submitted for discussion. Updates of the search index are currently scheduled to happen twice a week. We'll see how this goes when geminispace keeps growing. Feedback welcome. regards Ren?
Le 2021-01-29 10:56, Ren? Wagner a ?crit?: > Hi, > > after a few days of public testing i'm happy to announce the launch of > geminispace.info, a new search provider for the Gemini protocol: > gemini://geminispace.info Nice! It always good to see more search engines. > This is NOT a fork of GUS, i intend to send patches to upstream > if appropiate. First patches has already been submitted for discussion. That's a good thing too. Your search engine seams to work really well. Thanks for hosting this! -- La?rte
Hey! > Nice! It always good to see more search engines. Yep, gemini has grown reasonably, i think its about time to distribute often used services like searching a bit. Not just burden Natalie with keeping GUS running. > Your search engine seams to work really well. Thanks for hosting this! You're welcome. There may have been some isses when using IPv6 connections, these should be resolved now. Regards Ren?
On Fri, Jan 29, 2021 at 10:56:29AM +0100, Ren? Wagner <rwagner at rw-net.de> wrote a message of 20 lines which said: > after a few days of public testing i'm happy to announce the launch of > geminispace.info, a new search provider for the Gemini protocol: > gemini://geminispace.info It no longer works, for me, returning "42 An unexpected error occurred" for every search. Same thing everywhere? Shameless advertisment: automatic monitoring of servers is great <gemini://gemini.bortzmeyer.org/software/manisha/>.
On 2021-02-26 14:32, Stephane Bortzmeyer wrote: > It no longer works, for me, returning "42 An unexpected error > occurred" for every search. Same thing everywhere? Indeed. "Search" does not work, but "Query backlinks" works. -- La?rte
Hi, i didn't realize that Stephane wrote to the mailing list and answered him directly to his personal mail. So here we go again. ;) It will work again - once the currently running indexing is done. Unfortunately this is a shortcoming of the current GUS implementation. Crawling and updating the search index are separate steps and the later one locks the search index database. Unfortunately, as geminispace grows, this becomes more of a pain than earlier cause indexing takes more time. It seems that a mirror of a webpage with a huge archive popped up a few days ago. I had to stop the crawl as it was still fetching this archive, i probably need to exclude this mirror until we were able to improve the performance of crawling/indexing. Unfortunately i'm not that familiar with Python, so it may take some time. Especially the data gathering parts (crawling/indexing) are currently bottlenecks. They are strictly sequential and single-threaded "one page at a time" which will prolong these processes increasingly. But there are some more issues which arise as geminispace keeps growing, GUS was not designed to index large capsulses which mirrors of webpages. If you are interested in helping out with Python coding feel free to join, every help is welcome: https://src.clttr.info/rwa/geminispace.info/issues I'm not sure if its feasible to improve GUS in a sustainable way or if we need to start over and come up with a new design that honors the growth of geminispace (and is likely much more complex than GUS currently is). regards Ren?
I was kind of expecting to see a solution based on an existing search engine to emerge, such as elastic search, by implementing only the gemini specific parts, but I looked into quite a few project and all were terribly complicated?
On Sat, Feb 27, 2021 at 10:21:18AM +0100, C?me Chilliet <come at chilliet.eu> wrote a message of 4 lines which said: > I was kind of expecting to see a solution based on an existing > search engine to emerge, such as elastic search, by implementing > only the gemini specific parts, but I looked into quite a few > project and all were terribly complicated? A search engine service has three parts: the crawler, the indexer and the querier (the one the user interacts with). ElasticSearch could be a good idea for the last two (at least the second and may be part of the third). You still have to write the crawler and, speaking for experience, this is not a one week-end project. At the beginning, it is, you have a prototype running quite rapidly but then, in the real world, a lot of problems happen. My "favorite" is capsules accepting TCP, completing the TLS handshake, but then not replying to queries but there are also endless redirections and other "funny" stuff. A crawler has to be paranoid! Managing such a beast takes time, and the growth of the geminispace (47 capsules added yesterday, a new record, including one in catalan, apparently the first one) requires than you plan in advance: what works today won't in a few months.
I can't believe GUS has been around for over a year at this point. There was one fairly substantial refactor about 4-5 months in, but the architecture is still *mostly* the same as it was on day one. I think that's kind of cool, because that means GUS has mostly survived a bit over an order of magnitude of Geminispace growth - the first GUS crawl indexed 26 domains! I'm not sure if you were looking for the story of GUS, or an explanation of "why not Elasticsearch," but I'll tell the story anyway in case it's interesting to anyone :) GUS has two parts (similar to how Stephane broke down the problem in another response): there's the crawler and there's the searcher. In the beginning, the crawler wrote directly to a TF-IDF index as it crawled. Then, the searcher, when run later, was simply a reader and searcher of that same index. On Sat, Feb 27, 2021 at 10:21:18AM +0100, C?me Chilliet wrote: > I was kind of expecting to see a solution based on an existing > search engine to emerge, such as elastic search, by implementing > only the gemini specific parts, but I looked into quite a few > project and all were terribly complicated? It's interesting to note that, although I chose to build slightly closer to the metal than, say, Elasticsearch, it's still basically the same tech. Elasticsearch wraps niceties around Lucene, which is a library for doing TF-IDF development. I chose to use a different TF-IDF library directly in GUS - I thought for fun, I would try one that was native to Python, to keep the project as self-contained as possible. I found Whoosh [1], and while it's had a few sharp edges, it's mostly been great to work with. [1] https://whoosh.readthedocs.io/ Onto the big refactor I mentioned happened 4-5 months in. TF-IDF indexes I would best describe as "fragile." You have to be careful about interrupting any writes, you don't get easy transactions or rollbacks like you do with most relational databases. This became increasingly frustrating to have to deal with as part of the crawl, which (again, as Stephane) is a difficult beast to manage in the real world. Capsules get creative :) So I added one extra piece to GUS' architecture - a SQLite database. The crawl writes to the SQLite database, then the TF-IDF index is built afterwards, from what's in the database. I think GUS is about due for another rearchitecture of the crawler. A lot of the recent hacking on GUS is being done by a few folks like Rene with geminispace.info, so I would also be curious what their ideas are, but I think there could be a lot of promise taking a similar approach to what mozz did for his Gemini archiver [2]. [2] https://github.com/michael-lazar/mozz-archiver/ I also think the WARC output format is cool. It makes me think of a world in which maybe we could even get rid of the need for crawling - just provide tooling that folks can use to package their own capsule into a WARC, then they submit it. It would be a more efficient use of network bandwidth than crawling is, and it would also naturally be opt-in, which after lots of reflection, I think would be a nice property for a place like Gemini. Well that's the story of GUS. Let me know if you have any questions or want me to expound on any of this. I will close by saying I am very excited and thankful to see new folks like Rene and Remco get so involved with this project, and I am really looking forward to the future of search in Gemini. Warm regards, Natalie
On Sat, Feb 27, 2021 at 11:16:46AM +0100, Stephane Bortzmeyer wrote: > the third). You still have to write the crawler and, speaking for > experience, this is not a one week-end project. At the beginning, it > is, you have a prototype running quite rapidly but then, in the real > world, a lot of problems happen. My "favorite" is capsules accepting > TCP, completing the TLS handshake, but then not replying to queries > but there are also endless redirections and other "funny" stuff. Indeed. I remember many years ago writing a crawler for the web and finding similar challenges, despite initially getting excited with a barebones prototype with just a few lines of code. It's not easy to handle all the exceptions to the regular behaviour. > growth of the geminispace (47 capsules added yesterday, a new record, > including one in catalan, apparently the first one) requires than you > plan in advance: what works today won't in a few months. Thank you so much St?phane for Lupa and the statistics you provide, it's an incredibly interesting way to see the evolution of Gemini: gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi -- Vasco Costa AKA gluon. Enthusiastic about computers, motorsports, science, technology, travelling and TV series. Yes I'm a bit of a geek. Gemini: gemini://gluonspace.com/ Gopher: gopher://gopher.geeksphere.tk/
---