Hi all, I have a few topics concerning crawling geminispace that I would like to bring up. First, I have recently updated my web proxy @ https://portal.mozz.us to respect robots.txt files placed at the root of a gemini capsule. The proxy will follow rules for either the "webproxy" or "webproxy-mozz" user agents. My feelings have gone back and forth on this point because I'm not convinced that a web proxy falls into the category of a "crawler"; it's a network client that makes requests on behalf of real users. Nonetheless, geminispace has been growing and I have started to see links to gemlog articles via proxy show up on twitter and the typical tech discussion boards. This clashing of communities makes me nervous on a personal level so I want to give folks a chance to opt-out of it, at least for my proxy. Second, I have *slowly* been building a crawler for gemini [0]. My goal is to create a historical archive of geminispace and submit it somewhere for long-term preservation. This was inspired by Charles Lehner's recent email to the list [0]. I've spent a fair amount of time in gopher, and I am deeply saddened that so many of the important gopher servers from the 90's have been lost forever. Currently my crawler is respecting robots.txt files using the "mozz-archiver" user agent. I will probably standardize this to match my proxy (i.e. "archiver" and "archiver-mozz"). I am not 100% decided that I will even respect robots files for this project (archive.org doesn't [2]), but right now I'm leaning towards "yes". If you are currently mirroring a large website like wikipedia over gemini, I would greatly appreciate if you setup a robots.txt to block one of the above user-agents from those paths to make my life a bit easier. best, mozz [0] https://github.com/michael-lazar/mozz-archiver [1] gemini://gemi.dev/gemini-mailing-list/messages/002544.gmi [2] https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines -dont-work-well-for-web-archives/
It was thus said that the Great Michael Lazar once stated: > Hi all, > > Second, I have *slowly* been building a crawler for gemini [0]. My goal is to > create a historical archive of geminispace and submit it somewhere for > long-term preservation. This was inspired by Charles Lehner's recent email to > the list [0]. I've spent a fair amount of time in gopher, and I am deeply > saddened that so many of the important gopher servers from the 90's have been > lost forever. Currently my crawler is respecting robots.txt files using the > "mozz-archiver" user agent. I will probably standardize this to match my proxy > (i.e. "archiver" and "archiver-mozz"). I am not 100% decided that I will even > respect robots files for this project (archive.org doesn't [2]), but right now > I'm leaning towards "yes". If you are currently mirroring a large website like > wikipedia over gemini, I would greatly appreciate if you setup a robots.txt to > block one of the above user-agents from those paths to make my life a bit > easier. So I've added the following: User-agent: archiver User-agent: archiver-mozz User-agent: mozz-archiver Disallow: /boston to my /robots.txt file. I assume this will catch your archiver and prevent my blog from being archived, given that my blog is primarily web-based (and mirrored to Gemini and gopher) it is already archived (by existing web archives). Now a question: when people can check the archive, how will the missing portions of a Gemini site be presented? I'm blocking the above because it's a mirror of an existing web site (and might fit your "large" category, what with 20 years of content there), but there's no indication in the robots.txt file of that. -spc
On Mon, Oct 12, 2020 at 12:06 AM Sean Conner <sean at conman.org> wrote: > So I've added the following: > > User-agent: archiver > User-agent: archiver-mozz > User-agent: mozz-archiver > Disallow: /boston > > to my /robots.txt file. I assume this will catch your archiver and prevent > my blog from being archived, given that my blog is primarily web-based (and > mirrored to Gemini and gopher) it is already archived (by existing web > archives). > > Now a question: when people can check the archive, how will the missing > portions of a Gemini site be presented? I'm blocking the above because it's > a mirror of an existing web site (and might fit your "large" category, what > with 20 years of content there), but there's no indication in the robots.txt > file of that. I plan on always archiving the /robots.txt file if it exists, even if it's excluded by the parsing rules. So you could always add your comments there using the "#" syntax which is part of the 1994 robots.txt standard [0]. > User-agent: archiver > Disallow: /boston # Content is a mirror of https://... I also plan on recording failed requests as part of the archive; so you can tell the difference between a bad DNS lookup and a connection timeout, etc. I'll see if I can add a robots exclusion type error t to make it transparent why the resource is missing. (By the way, something like your blog archive does not fit my definition of large :D) [0] https://www.robotstxt.org/orig.html - mozz
Hello, Both of these projects and ideas sound great! Maybe it's obvious, but just in case: Make sure you respect the user agent of * as well, which signifies any robot. makeworld
---
Previous Thread: [ANN] Kineto, an HTTP to Gemini proxy/gateway