💾 Archived View for gemi.dev › gemini-mailing-list › 000933.gmi captured on 2023-12-28 at 15:55:17. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-11-04)

🚧 View Differences

-=-=-=-=-=-=-

Bad bots

1. Anna “CyberTailor” (cyber (a) sysrq.in)

Who owns a crawler running on IP 140.82.24.154?

It doesn't respect robots.txt and makes requests too fast. I had to
ban this IP with a firewall rule.

This thread is open for complaints on other bad behaving bots.

Link to individual message.

2. Kevin Sangeelee (kevin (a) susa.net)

Tangentially, I recommend anyone writing crawlers first aggregate the
links, remove duplicates, then sort them randomly before processing as a
single batch.

For example:

cat found-links.txt | sort | uniq | sort -R >links-to-process.txt

This approach mitigates a lot of edge case problems with fully automated
crawlers, and ensures no one server gets inundated with requests.

Kevin

Sent from phone

On Wed, 9 Jun 2021, 09:35 Anna “CyberTailor”, <cyber@sysrq.in> wrote:

> Who owns a crawler running on IP 140.82.24.154?
>
> It doesn't respect robots.txt and makes requests too fast. I had to
> ban this IP with a firewall rule.
>
> This thread is open for complaints on other bad behaving bots.
>

Link to individual message.

3. (remyabel (a) tilde.team)

On Wed, Jun 09, 2021 at 01:35:27PM +0500, Anna “CyberTailor” wrote:
> Who owns a crawler running on IP 140.82.24.154?
> 
> It doesn't respect robots.txt and makes requests too fast. I had to
> ban this IP with a firewall rule.
> 
> This thread is open for complaints on other bad behaving bots.

I'm not sure what purpose publishing the IP address is? You gain nothing
and potentially violate the privacy of a valid user. With that being
said, I don't recommend manually sifting through logs and manually
writing firewall rules. You might want to look into something like
fail2ban which will automatically ban misbehaving visitors for you.
There are also other posts talking about putting your gemini services
behind a reverse proxy which might help.

Link to individual message.

4. Thomas Groman (tgrom.automail (a) nuegia.net)

On Wed, 9 Jun 2021 13:35:27 +0500
Anna “CyberTailor” <cyber@sysrq.in> wrote:

> Who owns a crawler running on IP 140.82.24.154?
> 
> It doesn't respect robots.txt and makes requests too fast. I had to
> ban this IP with a firewall rule.
> 
> This thread is open for complaints on other bad behaving bots.

how about just putting rate limiting into your server?

-- 
 _______________________________________ 
/  Against his wishes, a math teacher's \
| classroom was remodeled. Ever since,  |
| he's been talking about the good old  |
| dais. His students planted a small    |
| orchard in his honor; the trees all   |
\ have square roots.                    /
 --------------------------------------- 
\
 \
   /\   /\   
  //\\_//\\     ____
  \_     _/    /   /
   / * * \    /^^^]
   \_\O/_/    [   ]
    /   \_    [   /
    \     \_  /  /
     [ [ /  \/ _/
    _[ [ \  /_/

Link to individual message.

---

Previous Thread: [Hardware Question] What is the minamum hardware needed to run gemini

Next Thread: A good way to generate HTML Pages from a static site