💾 Archived View for gemi.dev › gemini-mailing-list › 000645.gmi captured on 2024-08-19 at 01:31:41. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

[users] [announce] geminispace.info - alternative search provider

📧 Messages: 10
🗣️ Authors: 6
📅 First Message: 2021-01-29 09:56
📅 Last Message: 2021-02-27 14:39

1. René Wagner (rwagner (a) rw-net.de)

📅 Sent: 2021-01-29 09:56
📧 Message 1 of 10

Hi,

after a few days of public testing i'm happy to announce the launch of 
geminispace.info, a new search provider for the Gemini protocol:
gemini://geminispace.info

This is not about reinventing the wheel, geminispace.info is a 
independent instance of the great GUS by Natalie Pendragon 
and contributors: https://natpen.net/code/gus

This is NOT a fork of GUS, i intend to send patches to upstream
if appropiate. First patches has already been submitted for discussion.

Updates of the search index are currently scheduled to happen twice a week.
We'll see how this goes when geminispace keeps growing. 

Feedback welcome.

regards
Ren?

Link to individual message.

2. contact (a) medusae.space (contact (a) medusae.space)

📅 Sent: 2021-01-30 20:27
📧 Message 2 of 10

Le 2021-01-29 10:56, Ren? Wagner a ?crit?:
> Hi,
> 
> after a few days of public testing i'm happy to announce the launch of
> geminispace.info, a new search provider for the Gemini protocol:
> gemini://geminispace.info

Nice! It always good to see more search engines.

> This is NOT a fork of GUS, i intend to send patches to upstream
> if appropiate. First patches has already been submitted for discussion.

That's a good thing too.

Your search engine seams to work really well. Thanks for hosting this!

-- 
La?rte

Link to individual message.

3. René Wagner (rwagner (a) rw-net.de)

📅 Sent: 2021-01-31 11:51
📧 Message 3 of 10

Hey!

> Nice! It always good to see more search engines.
Yep, gemini has grown reasonably, i think its about time to distribute 
often used services like searching a bit. Not just burden Natalie
with keeping GUS running.
 
> Your search engine seams to work really well. Thanks for hosting this!
You're welcome.

There may have been some isses when using IPv6 connections, 
these should be resolved now.

Regards
Ren?

Link to individual message.

4. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2021-02-26 13:32
📧 Message 4 of 10

On Fri, Jan 29, 2021 at 10:56:29AM +0100,
 Ren? Wagner <rwagner at rw-net.de> wrote 
 a message of 20 lines which said:

> after a few days of public testing i'm happy to announce the launch of 
> geminispace.info, a new search provider for the Gemini protocol:
> gemini://geminispace.info

It no longer works, for me, returning "42 An unexpected error
occurred" for every search. Same thing everywhere?

Shameless advertisment: automatic monitoring of servers is
great <gemini://gemini.bortzmeyer.org/software/manisha/>.

Link to individual message.

5. contact (a) medusae.space (contact (a) medusae.space)

📅 Sent: 2021-02-26 16:16
📧 Message 5 of 10

On 2021-02-26 14:32, Stephane Bortzmeyer wrote:
> It no longer works, for me, returning "42 An unexpected error
> occurred" for every search. Same thing everywhere?

Indeed. "Search" does not work, but "Query backlinks" works.

-- 
La?rte

Link to individual message.

6. René Wagner (rwagner (a) rw-net.de)

📅 Sent: 2021-02-26 17:17
📧 Message 6 of 10

Hi,

i didn't realize that Stephane wrote to the mailing list and 
answered him directly to his personal mail. So here we go again. ;)


It will work again - once the currently running indexing is done.
Unfortunately this is a shortcoming of the current GUS implementation.

Crawling and updating the search index are separate steps and the later one
locks the search index database.
Unfortunately, as geminispace grows, this becomes more of a pain than earlier 
cause indexing takes more time.

It seems that a mirror of a webpage with a huge archive popped up a few
days ago. I had to stop the crawl as it was still fetching this archive, i
probably need to exclude this mirror until we were able to improve the
performance of crawling/indexing.
Unfortunately i'm not that familiar with Python, so it may take some time.

Especially the data gathering parts (crawling/indexing) are currently bottlenecks. 
They are strictly sequential and single-threaded "one page at a time"
which will prolong these processes increasingly. 
But there are some more issues which arise as geminispace keeps growing,
GUS was not designed to index large capsulses which mirrors of webpages.

If you are interested in helping out with Python coding feel free to join,
every help is welcome:
https://src.clttr.info/rwa/geminispace.info/issues

I'm not sure if its feasible to improve GUS in a sustainable way or if we need
to start over and come up with a new design that honors the growth of geminispace 
(and is likely much more complex than GUS currently is).

regards
Ren?

Link to individual message.

7. Côme Chilliet (come (a) chilliet.eu)

📅 Sent: 2021-02-27 09:21
📧 Message 7 of 10

I was kind of expecting to see a solution based on an existing search 
engine to emerge, such as elastic search, by implementing only the gemini 
specific parts, but I looked into quite a few project and all were terribly complicated?

Link to individual message.

8. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2021-02-27 10:16
📧 Message 8 of 10

On Sat, Feb 27, 2021 at 10:21:18AM +0100,
 C?me Chilliet <come at chilliet.eu> wrote 
 a message of 4 lines which said:

> I was kind of expecting to see a solution based on an existing
> search engine to emerge, such as elastic search, by implementing
> only the gemini specific parts, but I looked into quite a few
> project and all were terribly complicated?

A search engine service has three parts: the crawler, the indexer and
the querier (the one the user interacts with). ElasticSearch could be
a good idea for the last two (at least the second and may be part of
the third). You still have to write the crawler and, speaking for
experience, this is not a one week-end project. At the beginning, it
is, you have a prototype running quite rapidly but then, in the real
world, a lot of problems happen. My "favorite" is capsules accepting
TCP, completing the TLS handshake, but then not replying to queries
but there are also endless redirections and other "funny" stuff. A
crawler has to be paranoid! Managing such a beast takes time, and the
growth of the geminispace (47 capsules added yesterday, a new record,
including one in catalan, apparently the first one) requires than you
plan in advance: what works today won't in a few months.

Link to individual message.

9. Natalie Pendragon (natpen (a) natpen.net)

📅 Sent: 2021-02-27 13:48
📧 Message 9 of 10

I can't believe GUS has been around for over a year at this point.
There was one fairly substantial refactor about 4-5 months in, but the
architecture is still *mostly* the same as it was on day one. I think
that's kind of cool, because that means GUS has mostly survived a bit
over an order of magnitude of Geminispace growth - the first GUS crawl
indexed 26 domains! I'm not sure if you were looking for the story of
GUS, or an explanation of "why not Elasticsearch," but I'll tell the
story anyway in case it's interesting to anyone :)

GUS has two parts (similar to how Stephane broke down the problem in
another response): there's the crawler and there's the searcher. In
the beginning, the crawler wrote directly to a TF-IDF index as it
crawled. Then, the searcher, when run later, was simply a reader and
searcher of that same index.

On Sat, Feb 27, 2021 at 10:21:18AM +0100, C?me Chilliet wrote:
> I was kind of expecting to see a solution based on an existing
> search engine to emerge, such as elastic search, by implementing
> only the gemini specific parts, but I looked into quite a few
> project and all were terribly complicated?

It's interesting to note that, although I chose to build slightly
closer to the metal than, say, Elasticsearch, it's still basically the
same tech. Elasticsearch wraps niceties around Lucene, which is a
library for doing TF-IDF development. I chose to use a different
TF-IDF library directly in GUS - I thought for fun, I would try one
that was native to Python, to keep the project as self-contained as
possible. I found Whoosh [1], and while it's had a few sharp edges,
it's mostly been great to work with.

[1] https://whoosh.readthedocs.io/

Onto the big refactor I mentioned happened 4-5 months in. TF-IDF
indexes I would best describe as "fragile." You have to be careful
about interrupting any writes, you don't get easy transactions or
rollbacks like you do with most relational databases. This became
increasingly frustrating to have to deal with as part of the crawl,
which (again, as Stephane) is a difficult beast to manage in the real
world. Capsules get creative :) So I added one extra piece to GUS'
architecture - a SQLite database. The crawl writes to the SQLite
database, then the TF-IDF index is built afterwards, from what's in
the database.

I think GUS is about due for another rearchitecture of the crawler. A
lot of the recent hacking on GUS is being done by a few folks like
Rene with geminispace.info, so I would also be curious what their
ideas are, but I think there could be a lot of promise taking a
similar approach to what mozz did for his Gemini archiver [2].

[2] https://github.com/michael-lazar/mozz-archiver/

I also think the WARC output format is cool. It makes me think of a
world in which maybe we could even get rid of the need for crawling -
just provide tooling that folks can use to package their own capsule
into a WARC, then they submit it. It would be a more efficient use of
network bandwidth than crawling is, and it would also naturally be
opt-in, which after lots of reflection, I think would be a nice
property for a place like Gemini.

Well that's the story of GUS. Let me know if you have any questions or
want me to expound on any of this. I will close by saying I am very
excited and thankful to see new folks like Rene and Remco get so
involved with this project, and I am really looking forward to the
future of search in Gemini.

Warm regards,
Natalie

Link to individual message.

10. Vasco Costa (vasco.costa (a) gmx.com)

📅 Sent: 2021-02-27 14:39
📧 Message 10 of 10

On Sat, Feb 27, 2021 at 11:16:46AM +0100, Stephane Bortzmeyer wrote:
> the third). You still have to write the crawler and, speaking for
> experience, this is not a one week-end project. At the beginning, it
> is, you have a prototype running quite rapidly but then, in the real
> world, a lot of problems happen. My "favorite" is capsules accepting
> TCP, completing the TLS handshake, but then not replying to queries
> but there are also endless redirections and other "funny" stuff.

Indeed. I remember many years ago writing a crawler for the web and
finding similar challenges, despite initially getting excited with a
barebones prototype with just a few lines of code. It's not easy to
handle all the exceptions to the regular behaviour.

> growth of the geminispace (47 capsules added yesterday, a new record,
> including one in catalan, apparently the first one) requires than you
> plan in advance: what works today won't in a few months.

Thank you so much St?phane for Lupa and the statistics you provide, it's
an incredibly interesting way to see the evolution of Gemini:

gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi

--
Vasco Costa

AKA gluon. Enthusiastic about computers, motorsports, science,
technology, travelling and TV series. Yes I'm a bit of a geek.

Gemini: gemini://gluonspace.com/
Gopher: gopher://gopher.geeksphere.tk/

Link to individual message.

---

Previous Thread: Gemini input validation errors

Next Thread: [ANN] Mansfield client and server