💾 Archived View for station.martinrue.com › marginalia › 046c1e55e591484ba83cf484b382768e captured on 2023-09-28 at 17:58:11. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-01-29)
-=-=-=-=-=-=-
I'm interested in adapting my search engine to crawl geminispace as well, but I know a lot of people are hosting their stuff on low power hardware like raspberry pis and whatnot, and I don't think robots.txt seems to be a thing. What's a good, polite and non-disruptive page-fetch interval do you guys reckon? I was thinking 1 sec per fetch, but that may even be a bit too high. 5s interval?
2 years ago · 👍 calgacus, lykso, kevinsan
@kevinsan I unfortunately don't think that's a feasible approach given how the search engine works, bottom line is that it would entail too many database transactions. However there is really no practical limit to how many sites I crawl at the same time, or how much wall clock time I spend crawling each site, so I can always just spin up more threads and run them slowly for a long time. · 2 years ago
My approach is to avoid crawling a single site - aggregate the URLs to be fetched over multiple hosts, then randomise them and fetch sequentially. Regardless, I urge anyone with concerns to do the arithmetic. The overhead vs capacity of even modest hardware is truly negligible. · 2 years ago
@marginalia I mean you could just be so overly aggressive that your force people to use robots.txt! Not recommended though. 5 seconds is probably better than 1 second for the smol web imo. · 2 years ago
@calgacus That's great! I just haven't found any while checking manually so I assumed people weren't using them (they are by necessity very common in the HTTP-space, like you'll get absurd amounts of traffic from SEO-crawlers like ahrefsbot, semrush, mj12, etc.) -- so my question about sensible crawl-delays still remain though. · 2 years ago
@calgacus TIL! · 2 years ago
Just so you're aware, robots.txt is definitely a thing for Gemini capsules! gemini://gemini.circumlunar.space/docs/companion/robots.gmi · 2 years ago