💾 Archived View for bbs.geminispace.org › s › Gemini › 13198 captured on 2024-03-21 at 18:24:06. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2024-02-05)
-=-=-=-=-=-=-
Any experience with data scraping / crawler here on GEMINI? Can it be verified? I became aware of DARK VISITORS and wonder if it is worth implementing something from there. https://darkvisitors.com/ ( A list of known AI agents on the internet)
2023-12-29 · 3 months ago
👤 AnoikisNomads · Dec 29 at 15:54:
disclaimer: I haven't done my due diligence and checked.
personally I'm not very confident that a robots.txt entry will prevent and crawlers / scrapers from doing anything. I mentally treat the file as a gentle suggestion to complete strangers and businesses to be nice and read the signs.
data is some businesses, well, business. I suspect they'll crawl whatever is accessible, both under known crawler ids and generic chrome browser ids.
won't hurt to try and add the agents to the robots.txt but I wouldn't get my hopes up.
I use the ip addresses of the clients and check who owns the ASN to detect the crawlers from google, openai, facebook...
Many crawlers in geminispace don't response robots.txt. They don't even request it. I constantly have naive crawlers endlessly crawling all of Wikpedia via my Gemipedia CGI, or endlessly crawling news websites via my NewWaffle CGI. It's rather annoying and when I see it I block those IP's with iptables.
One thing I do, that I wish other crawlers did, is request robots.txt with a query string telling what the crawler is. For example, Kennedy always send a request for /robots.txt?kennedy-crawler before requesting any other URLs