💾 Archived View for station.martinrue.com › marginalia › 046c1e55e591484ba83cf484b382768e captured on 2023-01-29 at 04:53:38. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2022-07-16)

➡️ Next capture (2024-05-12)

🚧 View Differences

-=-=-=-=-=-=-

👽 marginalia

I'm interested in adapting my search engine to crawl geminispace as well, but I know a lot of people are hosting their stuff on low power hardware like raspberry pis and whatnot, and I don't think robots.txt seems to be a thing. What's a good, polite and non-disruptive page-fetch interval do you guys reckon? I was thinking 1 sec per fetch, but that may even be a bit too high. 5s interval?

2 years ago · 👍 calgacus, lykso, kevinsan

Actions

👋 Join Station

6 Replies

👽 marginalia

@kevinsan I unfortunately don't think that's a feasible approach given how the search engine works, bottom line is that it would entail too many database transactions. However there is really no practical limit to how many sites I crawl at the same time, or how much wall clock time I spend crawling each site, so I can always just spin up more threads and run them slowly for a long time. · 2 years ago

👽 kevinsan

My approach is to avoid crawling a single site - aggregate the URLs to be fetched over multiple hosts, then randomise them and fetch sequentially. Regardless, I urge anyone with concerns to do the arithmetic. The overhead vs capacity of even modest hardware is truly negligible. · 2 years ago

👽 calgacus

@marginalia I mean you could just be so overly aggressive that your force people to use robots.txt! Not recommended though. 5 seconds is probably better than 1 second for the smol web imo. · 2 years ago

👽 marginalia

@calgacus That's great! I just haven't found any while checking manually so I assumed people weren't using them (they are by necessity very common in the HTTP-space, like you'll get absurd amounts of traffic from SEO-crawlers like ahrefsbot, semrush, mj12, etc.) -- so my question about sensible crawl-delays still remain though. · 2 years ago

👽 lykso

@calgacus TIL! · 2 years ago

👽 calgacus

Just so you're aware, robots.txt is definitely a thing for Gemini capsules! gemini://gemini.circumlunar.space/docs/companion/robots.gmi · 2 years ago