On Mon, Oct 12, 2020 at 12:06 AM Sean Conner <sean at conman.org> wrote: > So I've added the following: > > User-agent: archiver > User-agent: archiver-mozz > User-agent: mozz-archiver > Disallow: /boston > > to my /robots.txt file. I assume this will catch your archiver and prevent > my blog from being archived, given that my blog is primarily web-based (and > mirrored to Gemini and gopher) it is already archived (by existing web > archives). > > Now a question: when people can check the archive, how will the missing > portions of a Gemini site be presented? I'm blocking the above because it's > a mirror of an existing web site (and might fit your "large" category, what > with 20 years of content there), but there's no indication in the robots.txt > file of that. I plan on always archiving the /robots.txt file if it exists, even if it's excluded by the parsing rules. So you could always add your comments there using the "#" syntax which is part of the 1994 robots.txt standard [0]. > User-agent: archiver > Disallow: /boston # Content is a mirror of https://... I also plan on recording failed requests as part of the archive; so you can tell the difference between a bad DNS lookup and a connection timeout, etc. I'll see if I can add a robots exclusion type error t to make it transparent why the resource is missing. (By the way, something like your blog archive does not fit my definition of large :D) [0] https://www.robotstxt.org/orig.html - mozz
---
Previous in thread (2 of 4): 🗣️ Sean Conner (sean (a) conman.org)
Next in thread (4 of 4): 🗣️ colecmac (a) protonmail.com (colecmac (a) protonmail.com)