crawling, proxies, and robots.txt

Sean Conner sean at conman.org

Mon Oct 12 05:06:01 BST 2020

- - - - - - - - - - - - - - - - - - -

It was thus said that the Great Michael Lazar once stated:

Hi all,

Second, I have *slowly* been building a crawler for gemini [0]. My goal is to

create a historical archive of geminispace and submit it somewhere for

long-term preservation. This was inspired by Charles Lehner's recent email to

the list [0]. I've spent a fair amount of time in gopher, and I am deeply

saddened that so many of the important gopher servers from the 90's have been

lost forever. Currently my crawler is respecting robots.txt files using the

"mozz-archiver" user agent. I will probably standardize this to match my proxy

(i.e. "archiver" and "archiver-mozz"). I am not 100% decided that I will even

respect robots files for this project (archive.org doesn't [2]), but right now

I'm leaning towards "yes". If you are currently mirroring a large website like

wikipedia over gemini, I would greatly appreciate if you setup a robots.txt to

block one of the above user-agents from those paths to make my life a bit

easier.

So I've added the following:

User-agent: archiverUser-agent: archiver-mozzUser-agent: mozz-archiverDisallow: /boston

to my /robots.txt file. I assume this will catch your archiver and preventmy blog from being archived, given that my blog is primarily web-based (andmirrored to Gemini and gopher) it is already archived (by existing webarchives).

Now a question: when people can check the archive, how will the missingportions of a Gemini site be presented? I'm blocking the above because it'sa mirror of an existing web site (and might fit your "large" category, whatwith 20 years of content there), but there's no indication in the robots.txtfile of that.

-spc