💾 Archived View for gemi.dev › gemini-mailing-list › 000933.gmi captured on 2023-12-28 at 15:55:17. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-11-04)
-=-=-=-=-=-=-
Who owns a crawler running on IP 140.82.24.154? It doesn't respect robots.txt and makes requests too fast. I had to ban this IP with a firewall rule. This thread is open for complaints on other bad behaving bots.
Tangentially, I recommend anyone writing crawlers first aggregate the links, remove duplicates, then sort them randomly before processing as a single batch. For example: cat found-links.txt | sort | uniq | sort -R >links-to-process.txt This approach mitigates a lot of edge case problems with fully automated crawlers, and ensures no one server gets inundated with requests. Kevin Sent from phone On Wed, 9 Jun 2021, 09:35 Anna “CyberTailor”, <cyber@sysrq.in> wrote: > Who owns a crawler running on IP 140.82.24.154? > > It doesn't respect robots.txt and makes requests too fast. I had to > ban this IP with a firewall rule. > > This thread is open for complaints on other bad behaving bots. >
On Wed, Jun 09, 2021 at 01:35:27PM +0500, Anna “CyberTailor” wrote: > Who owns a crawler running on IP 140.82.24.154? > > It doesn't respect robots.txt and makes requests too fast. I had to > ban this IP with a firewall rule. > > This thread is open for complaints on other bad behaving bots. I'm not sure what purpose publishing the IP address is? You gain nothing and potentially violate the privacy of a valid user. With that being said, I don't recommend manually sifting through logs and manually writing firewall rules. You might want to look into something like fail2ban which will automatically ban misbehaving visitors for you. There are also other posts talking about putting your gemini services behind a reverse proxy which might help.
On Wed, 9 Jun 2021 13:35:27 +0500 Anna “CyberTailor” <cyber@sysrq.in> wrote: > Who owns a crawler running on IP 140.82.24.154? > > It doesn't respect robots.txt and makes requests too fast. I had to > ban this IP with a firewall rule. > > This thread is open for complaints on other bad behaving bots. how about just putting rate limiting into your server? -- _______________________________________ / Against his wishes, a math teacher's \ | classroom was remodeled. Ever since, | | he's been talking about the good old | | dais. His students planted a small | | orchard in his honor; the trees all | \ have square roots. / --------------------------------------- \ \ /\ /\ //\\_//\\ ____ \_ _/ / / / * * \ /^^^] \_\O/_/ [ ] / \_ [ / \ \_ / / [ [ / \/ _/ _[ [ \ /_/
---
Previous Thread: [Hardware Question] What is the minamum hardware needed to run gemini
Next Thread: A good way to generate HTML Pages from a static site