馃懡 krixano

Currently running my crawler. It's quite slow right now, but it works. If anyone has problems with it hammering your server (the IP of the crawler is 67.60.37.132), do tell me so I can figure something out.

3 years ago 路 馃憤 skyjake, aka_dude

Actions

馃憢 Join Station

4 Replies

馃懡 krixano

@skyjake You can try what I have for my Search Engine so far here: gemini://pon.ix.tc/searchengine/

So far 1393 pages have been indexed. The ranking/scoring system hasn't been implemented yet, and neither have backlinks, but those are coming soon. 路 3 years ago

gemini://pon.ix.tc/searchengine/

馃懡 skyjake

I see! I find crawling and indexing an interesting problem, but don't have the time or resources to work on it myself... Good luck and keep us posted. 馃憤 路 3 years ago

馃懡 krixano

@skyjake Righ now I'm just trying to make it fast enough that it doesn't take 111 days to crawl all of geminispace, lol. I will be handling big files - right now I just download a file until i hit a limit file size, then I disconnect the connection before the rest of the file can download - but I'll probably change this in different ways in the future. It is intended to run continuously - it's for a new Gemini Search Engine I am making. Currently, capsules are randomely spread out due to the random ordering of hash tables, so hitting one capsule a bunch of times at once is less frequent (this was kinda an accident optimization, lol). 路 3 years ago

馃懡 skyjake

What are your plans for the crawler? Will it have some special intelligence for dealing with capsules that contain large/deep path trees, and/or multi-MB files? Is it intended for running continuously/autonomously, and does it have some sort of heuristics for figuring out how often content might be changing on capsules, so they can be reindexed in a timely (but not excessive) manner? 路 3 years ago