馃懡 marginalia

I'm interested in adapting my search engine to crawl geminispace as well, but I know a lot of people are hosting their stuff on low power hardware like raspberry pis and whatnot, and I don't think robots.txt seems to be a thing. What's a good, polite and non-disruptive page-fetch interval do you guys reckon? I was thinking 1 sec per fetch, but that may even be a bit too high. 5s interval?

3 years ago 路 馃憤 calgacus, lykso, kevinsan

Actions

馃憢 Join Station

6 Replies

馃懡 marginalia

@kevinsan I unfortunately don't think that's a feasible approach given how the search engine works, bottom line is that it would entail too many database transactions. However there is really no practical limit to how many sites I crawl at the same time, or how much wall clock time I spend crawling each site, so I can always just spin up more threads and run them slowly for a long time. 路 3 years ago

馃懡 kevinsan

My approach is to avoid crawling a single site - aggregate the URLs to be fetched over multiple hosts, then randomise them and fetch sequentially. Regardless, I urge anyone with concerns to do the arithmetic. The overhead vs capacity of even modest hardware is truly negligible. 路 3 years ago

馃懡 calgacus

@marginalia I mean you could just be so overly aggressive that your force people to use robots.txt! Not recommended though. 5 seconds is probably better than 1 second for the smol web imo. 路 3 years ago

馃懡 marginalia

@calgacus That's great! I just haven't found any while checking manually so I assumed people weren't using them (they are by necessity very common in the HTTP-space, like you'll get absurd amounts of traffic from SEO-crawlers like ahrefsbot, semrush, mj12, etc.) -- so my question about sensible crawl-delays still remain though. 路 3 years ago

馃懡 lykso

@calgacus TIL! 路 3 years ago

馃懡 calgacus

Just so you're aware, robots.txt is definitely a thing for Gemini capsules! gemini://gemini.circumlunar.space/docs/companion/robots.gmi 路 3 years ago

gemini://gemini.circumlunar.space/docs/companion/robots.gmi