Assuming disallow-all, and some research on robots.txt in Geminispace (Was: Re: robots.txt for Gemini formalised)

On Tue, 2020-11-24 at 18:44 -0500, John Cowan wrote:
> On Tue, Nov 24, 2020 at 3:25 PM Nick Thomas <gemini at ur.gs> wrote:
> 
> > Thanks for running the numbers on this. I agree with everything you
> > said based on them. That any change affects such a large proportion
> > of
> > existing geminispace is especially worth emphasising.
> > 
> 
> Why is that a Good Thing? 

I very intentionally *didn't* say it was a good thing :). There are
many ways to interpret the data, but I'm still glad we have it.

> It's another piece of bureaucracy: 90% of hosts
> were happy to be archived before

You're presuming consent here. We don't actually *know* that said 90%
of hosts are happy to be archived; we only know that 90% of hosts
haven't included a robots.txt file, which could be for any one of a
multitude of reasons.


files would actually prefer not to be included in archives when asked,
the current situation is not serving their privacy well, and gemini is
suppose to be protective of user privacy. *If* an overwhelming majority
of them simply don't care, then sure, the argument for it starts to
look a bit niche. Talking in IRC earlier today, I hand-waved a 5%
threshold for the first condition and 1% for the second.

A personal example: *I* didn't have a robots.txt on my capsule file
until today, but I don't want to be included in archives for various
reasons. Presuming consent from the lack of a robots.txt file would
have incorrectly guessed my preference, and harmed my privacy. Who else
in that 90% is like me? We don't know.

> so now they have to write a robots.txt
> file.  Although small for any one server operator, it is large when
> multiplied by the number of servers there *will be*.  "Small
> Internet" does
> not mean "Internet with only a few servers", AFAIK.

Yes, there is a convenience/privacy trade-off here. I interpret
gemini's mission to favour privacy over convenience when the two come
into conflict.

> Two things about the Internet Archive:
> 
> 1) It is a U.S. public library, which gives it special rights when it
> comes
> to making copies.

Certainly true, and there will be cases where, even when you do have
wonderfully hand-crafted robots.txt file like the one I made today, an
archiver determines that they can legally scrape you anyway. Others
will scrape illegally, whether through malice or ignorance.

Meanwhile, Google, the Internet Archive, and a bunch of other people
respect robots.txt even when they might not be legally *required* to
via GDPR-like provisions. A control doesn't have to be perfect to be
desirable. This argument comes up in the context of "right to be
forgotten" quite a lot ^^.

> 2) Though it does not respect robots.txt, it is happy to make your
> content
> invisible to archive users by informal request (or, of course, by a
> DCMA
> takedown notice).

As I understand it, archive.org does respect robots.txt in general, but
has exceptions for certain sites it's identified it has a public
interest justification for. That includes the US military, but probably
doesn't include any currently-existing gemini site.

/Nick

---

Previous in thread (38 of 70): 🗣️ James Tomasino (tomasino (a) lavabit.com)

Next in thread (40 of 70): 🗣️ Nick Thomas (gemini (a) ur.gs)

View entire thread.