💾 Archived View for bbs.geminispace.org › s › Gemini › 13198 captured on 2024-03-21 at 18:24:06. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2024-02-05)

➡️ Next capture (2024-05-10)

🚧 View Differences

-=-=-=-=-=-=-

Any experience with data scraping / crawler here on GEMINI? Can it be verified? I became aware of DARK VISITORS and wonder if it is worth implementing something from there. https://darkvisitors.com/ ( A list of known AI agents on the internet)

https://darkvisitors.com/

Posted in: s/Gemini

🍵 mimas

2023-12-29 · 3 months ago

3 Comments ↓

👤 AnoikisNomads · Dec 29 at 15:54:

disclaimer: I haven't done my due diligence and checked.

personally I'm not very confident that a robots.txt entry will prevent and crawlers / scrapers from doing anything. I mentally treat the file as a gentle suggestion to complete strangers and businesses to be nice and read the signs.

data is some businesses, well, business. I suspect they'll crawl whatever is accessible, both under known crawler ids and generic chrome browser ids.

won't hurt to try and add the agents to the robots.txt but I wouldn't get my hopes up.

🚀 Remy · Dec 29 at 16:49:

I use the ip addresses of the clients and check who owns the ASN to detect the crawlers from google, openai, facebook...

🧇 Acidus · Dec 30 at 22:16:

Many crawlers in geminispace don't response robots.txt. They don't even request it. I constantly have naive crawlers endlessly crawling all of Wikpedia via my Gemipedia CGI, or endlessly crawling news websites via my NewWaffle CGI. It's rather annoying and when I see it I block those IP's with iptables.

One thing I do, that I wish other crawlers did, is request robots.txt with a query string telling what the crawler is. For example, Kennedy always send a request for /robots.txt?kennedy-crawler before requesting any other URLs