robots.txt for Gemini

🗣️ From: Krixano (krixano (a) protonmail.com)
📅 Sent: 2020-03-22 22:39
📧 Message 2 of 8
You can go to literally any website and append "/robots.txt" to see what 
they use. I've already seen a couple that use "Allow".


Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????
On Sunday, March 22, 2020 1:13 PM, solderpunk <solderpunk at SDF.ORG> wrote:

> Howdy all,
>
> As the first and perhaps most important push toward getting some clear
> guidelines in place for well-behaved non-human Gemini clients (e.g.
> Gemini to web proxies, search engine spiders, feed aggregators, etc.),
> let's get to work on adapting robots.txt to Gemini.
>
> My current thinking is that this doesn't belong in the Gemini spec
> itself, much like robots.txt does not belong in the HTTP spec. That
> said, this feels like it warrants something more than just being put in
> the Best Practices doc. Maybe we need to start working on official
> "side specs", too. Not sure what these should be called.
>
> Anyway, I've refamiliarised myself with robots.txt. Turns out it is
> still only a de facto standard without an official RFC. My
> understanding is based on:
>
> -   The https://www.robotstxt.org/ website (which Wikipedia calls the
>     "Official website" at
>     https://en.wikipedia.org/wiki/Robots_exclusion_standard - it's not
>     clear to me what "official" means for a de facto standard), and in
>     particular:
>
> -   An old draft RFC from 1996 which that site hosts at
>     https://www.robotstxt.org/norobots-rfc.txt
>
> -   A new draft RFC from 2019 which appears to have gotten further than
>     the first, considering it is hosted by the IETF at
>     https://tools.ietf.org/html/draft-rep-wg-topic-00
>
>     While the 1996 draft is web-specific, I was pleasantly surprised to
>     see that the 2019 version is not. Section 2.3 says:
>
>
> > As per RFC3986 [1], the URI of the robots.txt is:
> > "scheme:[//authority]/robots.txt"
> > For example, in the context of HTTP or FTP, the URI is:
> > http://www.example.com/robots.txt
> > https://www.example.com/robots.txt
> > ftp://ftp.example.com/robots.txt
>
> So, why not Gemini too?
>
> Regarding the first practical question which was raised by Sean's recent
> post, it seems a no-brainer to me that Gemini should retain the
> convention of there being a single /robots.txt URL rather than having
> them per-directory or anything like that. Which it now seems was the
> intended behaviour of GUS all along, so I'm guessing nobody will find
> this controversial (but speak up if you do).
>
> https://www.robotstxt.org/robotstxt.html claims that excluding all files
> except one from robot access is "currently a bit awkward, as there is no
> "Allow" field". However, both the old and new RFC drafts clearly
> mention one. I am not sure exactly what the ground truth is here, in
> terms of how often Allow is used in the wild or to what extent it is
> obeyed even by well-intentiond bots. I would be very happy in principle
> to just declare that Allow lines are valid for Gemini robot.txt files,
> but if it turns out that popular programming languages have standard
> library tools for parsing robot.txt which don't choke on gemini:// URLs
> but don't recognise "Allow", this could quickly lead to unintended
> consequences, so perhaps it is best to be conservative here.
>
> If anybody happens to be familiar with current practice on the web with
> regard to Allow, please chime in.
>
> There is the question of caching. Both RFC drafts for robots.txt make
> it clear that standard HTTP caching mechanisms apply to robots.txt, but
> Gemini doesn't have an equivalent and I'm not interested in adding one
> yet, especially not for the purposes of robots.txt. And yet, obviously,
> some caching needs to take place. A spider requesting /robots.txt
> again and again for every document at a host is generating a lot of
> needless traffic. The 1996 RFC recommends "If no cache-control
> directives are present robots should default to an expiry of 7 days",
> while the 2019 one says "Crawlers SHOULD NOT use the cached version for
> more than 24 hours, unless the robots.txt is unreachable". My gut tells
> me most Gemini robots.txt files will change very infrequently and 7 days
> is more appropriate than 24 hours, but I'm happy for us to discuss this.
>
> The biggest question, in my mind, is what to do about user-agents, which
> Gemini lacks (by design, as they are a component of the browser
> fingerprinting problem, and because they encourage content developers to
> serve browser-specific content which is a bad thing IMHO). The 2019 RFC
> says "The product token SHOULD be part of the identification string that
> the crawler sends to the service" (where "product token" is bizarre and
> disappointingly commercial alternative terminology for "user-agent" in
> this document), so the fact that Gemini doesn't send one is not
> technically a violation.
>
> Of course, a robot doesn't need to send its user-agent in order to
> know its user-agent and interpet robots.txt accordingly. But it's
> much harder for Gemini server admins than their web counterparts to know
> exactly which bot is engaging in undesired behaviour and how to address
> it. Currently, the only thing that seems achievable in Gemini is to use
> the wildcard user-agent "*" to allow/disallow access by all bots to
> particular resources.
>
> But not all bots are equal. I'm willing to bet there are people using
> Gemini who are perfectly happy with e.g. the GUS search engine spider
> crawling their site to make it searchable via a service which is offered
> exclusively within Geminispace, but who are not happy with Gemini to web
> proxies accessing their content because they are concerned that
> poorly-written proxies will not disallow Google from crawling them so
> that Gemini content ends up being searchable within webspace. This is a
> perfectly reasonable stance to take and I think we should try to
> facilitate it.
>
> With no Gemini-specific changes to the de facto robots.txt spec, this
> would require admins to either manually maintain a whitelist of
> Gemini-only search engine spiders in their robots.txt or a blacklist
> of web proxies. This is easy today when you can count the number of
> either of things on one hand, but it does not scale well and is not a
> reasonable thing to expect admins to do in order to enforce a reasonable
> stance.
>
> (and, really, this isn't a Gemini-specific problem and I'm surprised
> that what I'm about to propose isn't a thing for the web)
>
> I have mentioned previously on this list (quickly, in passing), the idea
> of "meta user-agents" (I didn't use that term when I first mentioned
> it). But since there is no way for Gemini server admins to learn the
> user-agent of arbitrary bots, we could define a small (I'm thinking ~5
> would suffice, surely 10 at most) number of pre-defined user-agents
> which all bots of a given kind MUST respect (in addition to optionally
> having their own individual user-agent). A very rough sketch of some
> possibilities, not meant to be exhaustive or even very good, just to
> give the flavour:
>
> -   A user-agent of "webproxy" which must be respected by all web proxies.
>     Possibly this could have sub-types for proxies which do and don't
>     forbid web search engines?
>
> -   A user-agent of "search" which must be respected by all search engine
>     spiders
>
> -   A user-agent of "research" for bots which crawl a site without making
>     specific results of their crawl publically available (I've thought of
>     writing something like this to study the growth of Geminispace and the
>     structure of links between documents)
>
>     Enumerating actual use cases is probably the wrong way to go about it,
>     rather we should think of broad classes of behaviour which differ with
>     regard to privacy implications - e.g. bots which don't make the results
>     of their crawling public, bots which make their results public over
>     Gemini only, bots which breach the Gemini-web barrier, etc.
>
>     Do people think this is a good idea?
>
>     Can anybody think of other things to consider in adapting robots.txt to
>     Gemini?
>
>     Cheers,
>     Solderpunk
>
---
Previous in thread (1 of 8): 🗣️ solderpunk (solderpunk (a) SDF.ORG)
Next in thread (3 of 8): 🗣️ Jason McBrayer (jmcbray (a) dorothy.carcosa.net)
View entire thread.