Sean Conner sean at conman.org
Mon Oct 12 05:06:01 BST 2020
- - - - - - - - - - - - - - - - - - -
It was thus said that the Great Michael Lazar once stated:
Hi all,
Second, I have *slowly* been building a crawler for gemini [0]. My goal is to
create a historical archive of geminispace and submit it somewhere for
long-term preservation. This was inspired by Charles Lehner's recent email to
the list [0]. I've spent a fair amount of time in gopher, and I am deeply
saddened that so many of the important gopher servers from the 90's have been
lost forever. Currently my crawler is respecting robots.txt files using the
"mozz-archiver" user agent. I will probably standardize this to match my proxy
(i.e. "archiver" and "archiver-mozz"). I am not 100% decided that I will even
respect robots files for this project (archive.org doesn't [2]), but right now
I'm leaning towards "yes". If you are currently mirroring a large website like
wikipedia over gemini, I would greatly appreciate if you setup a robots.txt to
block one of the above user-agents from those paths to make my life a bit
easier.
So I've added the following:
User-agent: archiverUser-agent: archiver-mozzUser-agent: mozz-archiverDisallow: /boston
to my /robots.txt file. I assume this will catch your archiver and preventmy blog from being archived, given that my blog is primarily web-based (andmirrored to Gemini and gopher) it is already archived (by existing webarchives).
Now a question: when people can check the archive, how will the missingportions of a Gemini site be presented? I'm blocking the above because it'sa mirror of an existing web site (and might fit your "large" category, whatwith 20 years of content there), but there's no indication in the robots.txtfile of that.
-spc