Ohy, who's crawler is abusing TLGS?
I'm getting constant requests hitting the search API. It's not a big problem due to the low volume. And sicne I log nothing I don't know who is doint it. But you are generating a lot of errors. ,
2 years ago
I'm afraid my first post on the station is going to be an apology. One of those might have been mine. I've already sent you an email explaining my situation and apologizing. Hopefully this has already been resolved. But if not, y'all now know where to find me. · 2 years ago
It's sad to see I'm not the only one to face this issue. The ones I saw is quite weird thogh. It's hitting `/search?Gemini%20size:>20KB=========` with varying amount of equals each time. It's actually the parsing errors lead me to discovering it,
I don't know about logging to blackhole though. I'm trying my best to not log. But it seems necessary in this case. · 2 years ago
yeah, its kind of gross. I'm fine with crawlers as long as they 1) respect robots.txt and 2) are somewhat slow (~1 request sec). Most seem to fail #1 expect for ones like TLGS, Lupa, mine, etc. oh well · 2 years ago
@acidus Ugh. You made me check mine again and I found another crawler even worse than the first. Over 60,000 requests from it in the logs with it first appearing March 13. A ton of the responses it's been getting back are rate limit or client certificate errors so you'd think it'd take a hint. Neither of the two I've blackholed so far are coming from that capsule though. · 2 years ago
Same. I tracked down a crawler coming from the same IP address as this capsule:
gemini://frrobert.net/
They were aggressively crawling links through NewsWaffle, hitting over 450000 URLs in a week. It was literally crawling the entire Internet through my poor CGI 🤯
Problem was, NewsWaffle caches the HTML and this caused by VPC's disk to fill up faster than the cron job could clean it. I messaged them several times and got no response, so I blackholed their IP...
🤷🏻 · 2 years ago
I noticed a poorly made crawler on the rocketcaster capsule a couple weeks ago. Ignoring robots.txt, spamming expensive endpoints, and ignoring rate limits were just some of the things it was doing. I wonder if yours is the same one.
If it's causing you trouble, one option is to do some logging to find the IP of the crawler and black hole it. That's what I ended up doing. · 2 years ago