crawling, proxies, and robots.txt

🗣️ From: Michael Lazar (lazar.michael22 (a) gmail.com)
📅 Sent: 2020-10-13 03:56
📧 Message 3 of 4

On Mon, Oct 12, 2020 at 12:06 AM Sean Conner <sean at conman.org> wrote:
> So I've added the following:
>
> User-agent: archiver
> User-agent: archiver-mozz
> User-agent: mozz-archiver
> Disallow: /boston
>
> to my /robots.txt file. I assume this will catch your archiver and prevent
> my blog from being archived, given that my blog is primarily web-based (and
> mirrored to Gemini and gopher) it is already archived (by existing web
> archives).
>
>  Now a question: when people can check the archive, how will the missing
> portions of a Gemini site be presented? I'm blocking the above because it's
> a mirror of an existing web site (and might fit your "large" category, what
> with 20 years of content there), but there's no indication in the robots.txt
> file of that.

I plan on always archiving the /robots.txt file if it exists, even if it's
excluded by the parsing rules. So you could always add your comments there
using the "#" syntax which is part of the 1994 robots.txt standard [0].

> User-agent: archiver
> Disallow: /boston  # Content is a mirror of https://...

I also plan on recording failed requests as part of the archive; so you can
tell the difference between a bad DNS lookup and a connection timeout, etc.
I'll see if I can add a robots exclusion type error t to make it
transparent why the resource is missing.

(By the way, something like your blog archive does not fit my definition of
large :D)

[0] https://www.robotstxt.org/orig.html

- mozz

---

Previous in thread (2 of 4): 🗣️ Sean Conner (sean (a) conman.org)

Next in thread (4 of 4): 🗣️ colecmac (a) protonmail.com (colecmac (a) protonmail.com)

View entire thread.