crawling, proxies, and robots.txt

📧 Messages: 4
🗣️ Authors: 3
📅 First Message: 2020-10-12 03:39
📅 Last Message: 2020-10-23 18:12

1. Michael Lazar (lazar.michael22 (a) gmail.com)

📅 Sent: 2020-10-12 03:39
📧 Message 1 of 4

Hi all,

I have a few topics concerning crawling geminispace that I would like to bring
up.

First, I have recently updated my web proxy @ https://portal.mozz.us to respect
robots.txt files placed at the root of a gemini capsule. The proxy will follow
rules for either the "webproxy" or "webproxy-mozz" user agents. My feelings
have gone back and forth on this point because I'm not convinced that a web
proxy falls into the category of a "crawler"; it's a network client that makes
requests on behalf of real users. Nonetheless, geminispace has been growing and
I have started to see links to gemlog articles via proxy show up on twitter and
the typical tech discussion boards. This clashing of communities makes me
nervous on a personal level so I want to give folks a chance to opt-out of it,
at least for my proxy.

Second, I have *slowly* been building a crawler for gemini [0]. My goal is to
create a historical archive of geminispace and submit it somewhere for
long-term preservation. This was inspired by Charles Lehner's recent email to
the list [0]. I've spent a fair amount of time in gopher, and I am deeply
saddened that so many of the important gopher servers from the 90's have been
lost forever. Currently my crawler is respecting robots.txt files using the
"mozz-archiver" user agent. I will probably standardize this to match my proxy
(i.e. "archiver" and "archiver-mozz"). I am not 100% decided that I will even
respect robots files for this project (archive.org doesn't [2]), but right now
I'm leaning towards "yes". If you are currently mirroring a large website like
wikipedia over gemini, I would greatly appreciate if you setup a robots.txt to
block one of the above user-agents from those paths to make my life a bit
easier.

best,
mozz

[0] https://github.com/michael-lazar/mozz-archiver
[1] gemini://gemi.dev/gemini-mailing-list/messages/002544.gmi
[2] https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines
-dont-work-well-for-web-archives/

Link to individual message.

2. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-10-12 04:06
📧 Message 2 of 4

It was thus said that the Great Michael Lazar once stated:
> Hi all,
> 
> Second, I have *slowly* been building a crawler for gemini [0]. My goal is to
> create a historical archive of geminispace and submit it somewhere for
> long-term preservation. This was inspired by Charles Lehner's recent email to
> the list [0]. I've spent a fair amount of time in gopher, and I am deeply
> saddened that so many of the important gopher servers from the 90's have been
> lost forever. Currently my crawler is respecting robots.txt files using the
> "mozz-archiver" user agent. I will probably standardize this to match my proxy
> (i.e. "archiver" and "archiver-mozz"). I am not 100% decided that I will even
> respect robots files for this project (archive.org doesn't [2]), but right now
> I'm leaning towards "yes". If you are currently mirroring a large website like
> wikipedia over gemini, I would greatly appreciate if you setup a robots.txt to
> block one of the above user-agents from those paths to make my life a bit
> easier.

  So I've added the following:

User-agent: archiver
User-agent: archiver-mozz
User-agent: mozz-archiver
Disallow: /boston

to my /robots.txt file.  I assume this will catch your archiver and prevent
my blog from being archived, given that my blog is primarily web-based (and
mirrored to Gemini and gopher) it is already archived (by existing web
archives).

  Now a question: when people can check the archive, how will the missing
portions of a Gemini site be presented?  I'm blocking the above because it's
a mirror of an existing web site (and might fit your "large" category, what
with 20 years of content there), but there's no indication in the robots.txt
file of that.

  -spc

Link to individual message.

3. Michael Lazar (lazar.michael22 (a) gmail.com)

📅 Sent: 2020-10-13 03:56
📧 Message 3 of 4

On Mon, Oct 12, 2020 at 12:06 AM Sean Conner <sean at conman.org> wrote:
> So I've added the following:
>
> User-agent: archiver
> User-agent: archiver-mozz
> User-agent: mozz-archiver
> Disallow: /boston
>
> to my /robots.txt file. I assume this will catch your archiver and prevent
> my blog from being archived, given that my blog is primarily web-based (and
> mirrored to Gemini and gopher) it is already archived (by existing web
> archives).
>
>  Now a question: when people can check the archive, how will the missing
> portions of a Gemini site be presented? I'm blocking the above because it's
> a mirror of an existing web site (and might fit your "large" category, what
> with 20 years of content there), but there's no indication in the robots.txt
> file of that.

I plan on always archiving the /robots.txt file if it exists, even if it's
excluded by the parsing rules. So you could always add your comments there
using the "#" syntax which is part of the 1994 robots.txt standard [0].

> User-agent: archiver
> Disallow: /boston  # Content is a mirror of https://...

I also plan on recording failed requests as part of the archive; so you can
tell the difference between a bad DNS lookup and a connection timeout, etc.
I'll see if I can add a robots exclusion type error t to make it
transparent why the resource is missing.

(By the way, something like your blog archive does not fit my definition of
large :D)

[0] https://www.robotstxt.org/orig.html

- mozz

Link to individual message.

4. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-10-23 18:12
📧 Message 4 of 4

Hello,

Both of these projects and ideas sound great! Maybe it's obvious, but just in
case: Make sure you respect the user agent of * as well, which signifies any
robot.


makeworld

Link to individual message.

---

Previous Thread: [ANN] Kineto, an HTTP to Gemini proxy/gateway

Next Thread: Hello!