💾 Archived View for station.martinrue.com › haze › 6f52994095a342d9b009fdbe28037a85 captured on 2023-03-20 at 18:58:46. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
Ohy, who's crawler is abusing TLGS?
I'm getting constant requests hitting the search API. It's not a big problem due to the low volume. And sicne I log nothing I don't know who is doint it. But you are generating a lot of errors. ,
1 day ago
It's sad to see I'm not the only one to face this issue. The ones I saw is quite weird thogh. It's hitting `/search?Gemini%20size:>20KB=========` with varying amount of equals each time. It's actually the parsing errors lead me to discovering it,
I don't know about logging to blackhole though. I'm trying my best to not log. But it seems necessary in this case. · 14 hours ago
yeah, its kind of gross. I'm fine with crawlers as long as they 1) respect robots.txt and 2) are somewhat slow (~1 request sec). Most seem to fail #1 expect for ones like TLGS, Lupa, mine, etc. oh well · 18 hours ago
@acidus Ugh. You made me check mine again and I found another crawler even worse than the first. Over 60,000 requests from it in the logs with it first appearing March 13. A ton of the responses it's been getting back are rate limit or client certificate errors so you'd think it'd take a hint. Neither of the two I've blackholed so far are coming from that capsule though. · 18 hours ago
Same. I tracked down a crawler coming from the same IP address as this capsule:
gemini://frrobert.net/
They were aggressively crawling links through NewsWaffle, hitting over 450000 URLs in a week. It was literally crawling the entire Internet through my poor CGI 🤯
Problem was, NewsWaffle caches the HTML and this caused by VPC's disk to fill up faster than the cron job could clean it. I messaged them several times and got no response, so I blackholed their IP...
🤷🏻 · 21 hours ago
I noticed a poorly made crawler on the rocketcaster capsule a couple weeks ago. Ignoring robots.txt, spamming expensive endpoints, and ignoring rate limits were just some of the things it was doing. I wonder if yours is the same one.
If it's causing you trouble, one option is to do some logging to find the IP of the crawler and black hole it. That's what I ended up doing. · 23 hours ago