crawling, proxies, and robots.txt

🗣️ From: Michael Lazar (lazar.michael22 (a) gmail.com)
📅 Sent: 2020-10-12 03:39
📧 Message 1 of 4

Hi all,

I have a few topics concerning crawling geminispace that I would like to bring
up.

First, I have recently updated my web proxy @ https://portal.mozz.us to respect
robots.txt files placed at the root of a gemini capsule. The proxy will follow
rules for either the "webproxy" or "webproxy-mozz" user agents. My feelings
have gone back and forth on this point because I'm not convinced that a web
proxy falls into the category of a "crawler"; it's a network client that makes
requests on behalf of real users. Nonetheless, geminispace has been growing and
I have started to see links to gemlog articles via proxy show up on twitter and
the typical tech discussion boards. This clashing of communities makes me
nervous on a personal level so I want to give folks a chance to opt-out of it,
at least for my proxy.

Second, I have *slowly* been building a crawler for gemini [0]. My goal is to
create a historical archive of geminispace and submit it somewhere for
long-term preservation. This was inspired by Charles Lehner's recent email to
the list [0]. I've spent a fair amount of time in gopher, and I am deeply
saddened that so many of the important gopher servers from the 90's have been
lost forever. Currently my crawler is respecting robots.txt files using the
"mozz-archiver" user agent. I will probably standardize this to match my proxy
(i.e. "archiver" and "archiver-mozz"). I am not 100% decided that I will even
respect robots files for this project (archive.org doesn't [2]), but right now
I'm leaning towards "yes". If you are currently mirroring a large website like
wikipedia over gemini, I would greatly appreciate if you setup a robots.txt to
block one of the above user-agents from those paths to make my life a bit
easier.

best,
mozz

[0] https://github.com/michael-lazar/mozz-archiver
[1] gemini://gemi.dev/gemini-mailing-list/messages/002544.gmi
[2] https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines
-dont-work-well-for-web-archives/

---

Next in thread (2 of 4): 🗣️ Sean Conner (sean (a) conman.org)

View entire thread.