<-- back to the mailing list

Bad bots

Kevin Sangeelee kevin at susa.net

Wed Jun 9 11:25:38 BST 2021

- - - - - - - - - - - - - - - - - - - 

Tangentially, I recommend anyone writing crawlers first aggregate thelinks, remove duplicates, then sort them randomly before processing as asingle batch.

For example:

cat found-links.txt | sort | uniq | sort -R

links-to-process.txt

This approach mitigates a lot of edge case problems with fully automatedcrawlers, and ensures no one server gets inundated with requests.

Kevin

Sent from phone

On Wed, 9 Jun 2021, 09:35 Anna “CyberTailor”, <cyber at sysrq.in> wrote:

Who owns a crawler running on IP 140.82.24.154?
It doesn't respect robots.txt and makes requests too fast. I had to
ban this IP with a firewall rule.
This thread is open for complaints on other bad behaving bots.
-------------- next part --------------An HTML attachment was scrubbed...URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20210609/61e4fd4f/attachment.htm>