Hi all, I have a few topics concerning crawling geminispace that I would like to bring up. First, I have recently updated my web proxy @ https://portal.mozz.us to respect robots.txt files placed at the root of a gemini capsule. The proxy will follow rules for either the "webproxy" or "webproxy-mozz" user agents. My feelings have gone back and forth on this point because I'm not convinced that a web proxy falls into the category of a "crawler"; it's a network client that makes requests on behalf of real users. Nonetheless, geminispace has been growing and I have started to see links to gemlog articles via proxy show up on twitter and the typical tech discussion boards. This clashing of communities makes me nervous on a personal level so I want to give folks a chance to opt-out of it, at least for my proxy. Second, I have *slowly* been building a crawler for gemini [0]. My goal is to create a historical archive of geminispace and submit it somewhere for long-term preservation. This was inspired by Charles Lehner's recent email to the list [0]. I've spent a fair amount of time in gopher, and I am deeply saddened that so many of the important gopher servers from the 90's have been lost forever. Currently my crawler is respecting robots.txt files using the "mozz-archiver" user agent. I will probably standardize this to match my proxy (i.e. "archiver" and "archiver-mozz"). I am not 100% decided that I will even respect robots files for this project (archive.org doesn't [2]), but right now I'm leaning towards "yes". If you are currently mirroring a large website like wikipedia over gemini, I would greatly appreciate if you setup a robots.txt to block one of the above user-agents from those paths to make my life a bit easier. best, mozz [0] https://github.com/michael-lazar/mozz-archiver [1] gemini://gemi.dev/gemini-mailing-list/messages/002544.gmi [2] https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines -dont-work-well-for-web-archives/
---
Next in thread (2 of 4): 🗣️ Sean Conner (sean (a) conman.org)