You can go to literally any website and append "/robots.txt" to see what they use. I've already seen a couple that use "Allow". Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Sunday, March 22, 2020 1:13 PM, solderpunk <solderpunk at SDF.ORG> wrote: > Howdy all, > > As the first and perhaps most important push toward getting some clear > guidelines in place for well-behaved non-human Gemini clients (e.g. > Gemini to web proxies, search engine spiders, feed aggregators, etc.), > let's get to work on adapting robots.txt to Gemini. > > My current thinking is that this doesn't belong in the Gemini spec > itself, much like robots.txt does not belong in the HTTP spec. That > said, this feels like it warrants something more than just being put in > the Best Practices doc. Maybe we need to start working on official > "side specs", too. Not sure what these should be called. > > Anyway, I've refamiliarised myself with robots.txt. Turns out it is > still only a de facto standard without an official RFC. My > understanding is based on: > > - The https://www.robotstxt.org/ website (which Wikipedia calls the > "Official website" at > https://en.wikipedia.org/wiki/Robots_exclusion_standard - it's not > clear to me what "official" means for a de facto standard), and in > particular: > > - An old draft RFC from 1996 which that site hosts at > https://www.robotstxt.org/norobots-rfc.txt > > - A new draft RFC from 2019 which appears to have gotten further than > the first, considering it is hosted by the IETF at > https://tools.ietf.org/html/draft-rep-wg-topic-00 > > While the 1996 draft is web-specific, I was pleasantly surprised to > see that the 2019 version is not. Section 2.3 says: > > > > As per RFC3986 [1], the URI of the robots.txt is: > > "scheme:[//authority]/robots.txt" > > For example, in the context of HTTP or FTP, the URI is: > > http://www.example.com/robots.txt > > https://www.example.com/robots.txt > > ftp://ftp.example.com/robots.txt > > So, why not Gemini too? > > Regarding the first practical question which was raised by Sean's recent > post, it seems a no-brainer to me that Gemini should retain the > convention of there being a single /robots.txt URL rather than having > them per-directory or anything like that. Which it now seems was the > intended behaviour of GUS all along, so I'm guessing nobody will find > this controversial (but speak up if you do). > > https://www.robotstxt.org/robotstxt.html claims that excluding all files > except one from robot access is "currently a bit awkward, as there is no > "Allow" field". However, both the old and new RFC drafts clearly > mention one. I am not sure exactly what the ground truth is here, in > terms of how often Allow is used in the wild or to what extent it is > obeyed even by well-intentiond bots. I would be very happy in principle > to just declare that Allow lines are valid for Gemini robot.txt files, > but if it turns out that popular programming languages have standard > library tools for parsing robot.txt which don't choke on gemini:// URLs > but don't recognise "Allow", this could quickly lead to unintended > consequences, so perhaps it is best to be conservative here. > > If anybody happens to be familiar with current practice on the web with > regard to Allow, please chime in. > > There is the question of caching. Both RFC drafts for robots.txt make > it clear that standard HTTP caching mechanisms apply to robots.txt, but > Gemini doesn't have an equivalent and I'm not interested in adding one > yet, especially not for the purposes of robots.txt. And yet, obviously, > some caching needs to take place. A spider requesting /robots.txt > again and again for every document at a host is generating a lot of > needless traffic. The 1996 RFC recommends "If no cache-control > directives are present robots should default to an expiry of 7 days", > while the 2019 one says "Crawlers SHOULD NOT use the cached version for > more than 24 hours, unless the robots.txt is unreachable". My gut tells > me most Gemini robots.txt files will change very infrequently and 7 days > is more appropriate than 24 hours, but I'm happy for us to discuss this. > > The biggest question, in my mind, is what to do about user-agents, which > Gemini lacks (by design, as they are a component of the browser > fingerprinting problem, and because they encourage content developers to > serve browser-specific content which is a bad thing IMHO). The 2019 RFC > says "The product token SHOULD be part of the identification string that > the crawler sends to the service" (where "product token" is bizarre and > disappointingly commercial alternative terminology for "user-agent" in > this document), so the fact that Gemini doesn't send one is not > technically a violation. > > Of course, a robot doesn't need to send its user-agent in order to > know its user-agent and interpet robots.txt accordingly. But it's > much harder for Gemini server admins than their web counterparts to know > exactly which bot is engaging in undesired behaviour and how to address > it. Currently, the only thing that seems achievable in Gemini is to use > the wildcard user-agent "*" to allow/disallow access by all bots to > particular resources. > > But not all bots are equal. I'm willing to bet there are people using > Gemini who are perfectly happy with e.g. the GUS search engine spider > crawling their site to make it searchable via a service which is offered > exclusively within Geminispace, but who are not happy with Gemini to web > proxies accessing their content because they are concerned that > poorly-written proxies will not disallow Google from crawling them so > that Gemini content ends up being searchable within webspace. This is a > perfectly reasonable stance to take and I think we should try to > facilitate it. > > With no Gemini-specific changes to the de facto robots.txt spec, this > would require admins to either manually maintain a whitelist of > Gemini-only search engine spiders in their robots.txt or a blacklist > of web proxies. This is easy today when you can count the number of > either of things on one hand, but it does not scale well and is not a > reasonable thing to expect admins to do in order to enforce a reasonable > stance. > > (and, really, this isn't a Gemini-specific problem and I'm surprised > that what I'm about to propose isn't a thing for the web) > > I have mentioned previously on this list (quickly, in passing), the idea > of "meta user-agents" (I didn't use that term when I first mentioned > it). But since there is no way for Gemini server admins to learn the > user-agent of arbitrary bots, we could define a small (I'm thinking ~5 > would suffice, surely 10 at most) number of pre-defined user-agents > which all bots of a given kind MUST respect (in addition to optionally > having their own individual user-agent). A very rough sketch of some > possibilities, not meant to be exhaustive or even very good, just to > give the flavour: > > - A user-agent of "webproxy" which must be respected by all web proxies. > Possibly this could have sub-types for proxies which do and don't > forbid web search engines? > > - A user-agent of "search" which must be respected by all search engine > spiders > > - A user-agent of "research" for bots which crawl a site without making > specific results of their crawl publically available (I've thought of > writing something like this to study the growth of Geminispace and the > structure of links between documents) > > Enumerating actual use cases is probably the wrong way to go about it, > rather we should think of broad classes of behaviour which differ with > regard to privacy implications - e.g. bots which don't make the results > of their crawling public, bots which make their results public over > Gemini only, bots which breach the Gemini-web barrier, etc. > > Do people think this is a good idea? > > Can anybody think of other things to consider in adapting robots.txt to > Gemini? > > Cheers, > Solderpunk >
---
Previous in thread (1 of 8): 🗣️ solderpunk (solderpunk (a) SDF.ORG)
Next in thread (3 of 8): 🗣️ Jason McBrayer (jmcbray (a) dorothy.carcosa.net)