👽 jsreed5

My intermittent capsule outages are being caused by what appears to be a very aggressive crawler. The capsule's robots.txt file tells bots not to index my CGI scripts, but this crawler is ignoring the file and sending multiple requests per second against my scripts, which overloads the server and causes it to crash. I've temporarily solved the problem by blocking the crawler entirely; I'll look for a more permanent solution.

6 months ago

Actions

👋 Join Station

6 Replies

👽 clseibold2

I do hope it's not my crawler, which should be following robots.txt. I am still able to go to your capsule, so I don't think it is, but do tell me if it is. · 6 months ago

👽 m0xee

@danrl I think it would be possible to adopt tools we already use for web servers — like fail2ban. The server I use, gmid, has suitable log format, but TBH I've never even given it a though: I don't host any cgi and anything popular at all, so for me Gemini traffic isn't a concern yet. · 6 months ago

👽 m0xee

@jsreed5 Exactly what I've been looking for! Comprehensive answer indeed, thank you! · 6 months ago

👽 danrl

i could think of rate limits by ip address (or /64 subnet in case of ipv6). you get a few hits for free, and once the bucket is empty (after lets say 10 hits) you are throttled to 1 request every 30 seconds. bucket refills, obviously. that should match a human usage profile for a protocol that was made for reading and slow consumption. · 6 months ago

👽 jsreed5

@m0xee There are no substantive differences between Gemini robots.txt files and Web robots.txt files. The only difference for Gemini crawlers is they don't send a user agent string. favicon.txt is a proposed addition to Gemini that uses an emoji in a favicon.txt file as a site icon, the same way a Web page uses a favicon image. You can read more about robots.txt and favicon.txt files at: gemini://geminiprotocol.net/docs/companion/robots.gmi gemini://mozz.us/files/rfc_gemini_favicon.gmi · 6 months ago

gemini://geminiprotocol.net/docs/companion/robots.gmi

gemini://mozz.us/files/rfc_gemini_favicon.gmi

👽 m0xee

Are there any conventions for robots.txt for Gemini in particular?

I also wonder, what favicon.txt is? I've seen some software or maybe crawlers attempting to access it on my capsule, I was never into Gopher, is it a Gopher thing that got adopted on Gemini too or is it just something unofficial convention that some software uses? · 6 months ago