Kevin Sangeelee kevin at susa.net
Wed Jun 9 11:25:38 BST 2021
- - - - - - - - - - - - - - - - - - -
Tangentially, I recommend anyone writing crawlers first aggregate thelinks, remove duplicates, then sort them randomly before processing as asingle batch.
For example:
cat found-links.txt | sort | uniq | sort -R
links-to-process.txt
This approach mitigates a lot of edge case problems with fully automatedcrawlers, and ensures no one server gets inundated with requests.
Kevin
Sent from phone
On Wed, 9 Jun 2021, 09:35 Anna “CyberTailor”, <cyber at sysrq.in> wrote:
Who owns a crawler running on IP 140.82.24.154?
It doesn't respect robots.txt and makes requests too fast. I had to
ban this IP with a firewall rule.
This thread is open for complaints on other bad behaving bots.
-------------- next part --------------An HTML attachment was scrubbed...URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20210609/61e4fd4f/attachment.htm>