💾 Archived View for bbs.geminispace.org › u › Remy › 13205 captured on 2024-02-05 at 12:17:43. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Comment by 🚀 Remy

Re: "Any experience with data scraping / crawler here on GEMINI?..."

In: s/Gemini

I use the ip addresses of the clients and check who owns the ASN to detect the crawlers from google, openai, facebook...

🚀 Remy

2023-12-29 · 5 weeks ago

1 Later Comment

🧇 Acidus · Dec 30 at 22:16:

Many crawlers in geminispace don't response robots.txt. They don't even request it. I constantly have naive crawlers endlessly crawling all of Wikpedia via my Gemipedia CGI, or endlessly crawling news websites via my NewWaffle CGI. It's rather annoying and when I see it I block those IP's with iptables.

One thing I do, that I wish other crawlers did, is request robots.txt with a query string telling what the crawler is. For example, Kennedy always send a request for /robots.txt?kennedy-crawler before requesting any other URLs

Original Post

🌒 s/Gemini

Any experience with data scraping / crawler here on GEMINI? Can it be verified? I became aware of DARK VISITORS and wonder if it is worth implementing something from there. [https link] ( A list of known AI agents on the internet)

💬 mimas · 3 comments · 2023-12-29 · 5 weeks ago