💾 Archived View for station.martinrue.com › marginalia › 9ceff158c79b4e92ac170e012e3c39a7 captured on 2024-03-21 at 17:35:38. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-04-20)
-=-=-=-=-=-=-
Currently have a botnet spamming my search engine. I've blocked a couple of thousand and things seem to be holding up, but if it goes know you know what happened. Really don't want to have to hide behind cloudflare or something like that. They seem pretty sketchy from a privacy standpoint.
2 years ago
I'm always wary of unknown unknowns, but this seems pretty damning. · 2 years ago
What was the sliver of doubt? I can't think of anything that would legitimately make those queries over thousands of IP addresses. · 2 years ago
I tried on a whim just connecting to a few of the IP addresses with my web browser. The ones I could connect to gave me administration pages for enterprise grade routers. If there was any sliver of a doubt it was a botnet, it's now been removed. · 2 years ago
Perhaps they have a list of common search phrases, and now want to know what sites to target for injection, or to beat in the ranking. If so, you could return honeypot URLs to see what crawler turns up to have a look. · 2 years ago
@mntn That's an appealing idea, but tarpits are actually kind of a DOS vulnerability in themselves, as they require you to keep a ton of open connections. It's surpisingly expensive. What I am doing now seems to hold up. I've figured out a heuristic that very effectively identifies the bots, so unless they implement drastic changes I think this works. · 2 years ago
@kevinsan I have a hunch it's actually connected to the hacked wordpress installs I discovered earlier. Their URLs looked *a lot* like these search terms. They have that vibe of know "asimov foundation free online pdf download ebook", keyword stuffing. They're overspecified to the degree where they almost all return nothing or just nonsense dictionary type results. Maybe they're trying to figure out if their SEO spam works as intended? · 2 years ago
Reasoning: it will probably take them a while to notice that they are getting garbage results, so they can't automatically spin up a new IP when it happens. And it's just annoying, so they might give up eventually. · 2 years ago
Maybe instead of blocking the IPs, set up a tarpit you can send them to that serves up non-obvious garbage results... slowly. · 2 years ago
I wonder if someone's simply trying to mine your index? If there are some esoteric terms in their list, you could seed them with tracer URLs to see if they turn up in some up-coming startup's results. If this is what's happening, they're bound to reuse the list. · 2 years ago
@krsh It's kind of hard to distribute a database of this size. Not that it can't be done, it's just that it just requires at least an additional order of magnitude of hardware. The main reason it's currently so fast is that I'm able to keep most of it on the same machine, most of it in RAM even. · 2 years ago
@sk I've been able to fingerprint them pretty easily. First I gave them a 403 from my search engine and let that go on for a while, then I just grepped all the IPs that had gotten 403s and blocked them. A few are still getting through, but whatever. · 2 years ago
@kevinsan They are making search requests. Looks like they are working off some list. Don't really know what's up with the queries. Like in terms of load it's not that much worse than when all of hacker news were searching, plus I can block these as I discover them so it's pretty manageable. · 2 years ago
Are they actually making search requests, or is it something more generic? If proper requests, what kind of queries are they making? · 2 years ago
if you do find a solution for holding bots off let me know. my https gateway gets pretty hammered and i made a quick thing that permabans connections trying to find wordpress/etc exploits. doesn't defend on the gemini side of things but i haven't been hit with spam there yet · 2 years ago