robots.txt for Gemini formalised

🗣️ From: Nick Thomas (gemini (a) ur.gs)
📅 Sent: 2020-11-24 16:49
📧 Message 30 of 70

(I could be a lot better at using mailing lists. I think this message
was sent privately in error).

On Tue, 2020-11-24 at 08:15 -0500, A. E. Spencer-Reed wrote:
> Why do you dislike archival?

Thanks for weighing in!

In short, because the purposes to which the archive can be put, and the
motives of the archiver, are not clear at time of robots.txt-mediated
archival.

For myself, I'm happy with some types of archival, and not happy with
some other types. Some people would be happy to be included in every
archive going; others, in none of them. Given this variability, we must
take a stance on what to assume if robots.txt isn't present. I also I
don't think this variability is amenable to capture with more fine-
grained virtual agents. 

The current internet-draft for robots.txt says, in 2.2.1:

>  If no group satisfies either condition, or no groups are present at
> all, no rules apply.

( https://tools.ietf.org/html/draft-koster-rep-00 )

This is pretty standard on the Web and, entirely coincidentally, a huge
boon to Google et al. Importing robots.txt the way we do in the
companion specification also imports this line.

However, unlike the Web, Gemini "takes user privacy very seriously".
Archives *can* be injurious to user privacy - if you need convincing on
this point, there are a range of cases and examples around GDPR "right
to be forgotten" stuff. To my perspective, Gemini is important a line
from the internet-draft that is directly contrary to its mission.

Combining Gemini's mission with that realisation means that if no
statement has been made about whether the given user (server operator
in this specific case) is OK with their content being archived, the
presumption should be that they are not OK with it. We should value
user privacy above archiver convenience.

In affect, we add a second exception to the protocal that amends 2.2.1
to end "if no rules are specified, this robots.txt file MUST be
assumed".

On a practical level, being excluded from search engines by-default
drives the discoverability of robots.txt, and server software could
easily include flags like --permit-indexing or --permit-archival to
streamline that discoverability. I don't think that opt-in rates would
be similar to current opt-out rates on the Web.

/Nick

---

Previous in thread (29 of 70): 🗣️ marc (marcx2 (a) welz.org.za)

Next in thread (31 of 70): 🗣️ Johann Galle (johann (a) qwertqwefsday.eu)

View entire thread.