Howdy all, As the first and perhaps most important push toward getting some clear guidelines in place for well-behaved non-human Gemini clients (e.g. Gemini to web proxies, search engine spiders, feed aggregators, etc.), let's get to work on adapting robots.txt to Gemini. My current thinking is that this doesn't belong in the Gemini spec itself, much like robots.txt does not belong in the HTTP spec. That said, this feels like it warrants something more than just being put in the Best Practices doc. Maybe we need to start working on official "side specs", too. Not sure what these should be called. Anyway, I've refamiliarised myself with robots.txt. Turns out it is still only a de facto standard without an official RFC. My understanding is based on:
You can go to literally any website and append "/robots.txt" to see what they use. I've already seen a couple that use "Allow". Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Sunday, March 22, 2020 1:13 PM, solderpunk <solderpunk at SDF.ORG> wrote: > Howdy all, > > As the first and perhaps most important push toward getting some clear > guidelines in place for well-behaved non-human Gemini clients (e.g. > Gemini to web proxies, search engine spiders, feed aggregators, etc.), > let's get to work on adapting robots.txt to Gemini. > > My current thinking is that this doesn't belong in the Gemini spec > itself, much like robots.txt does not belong in the HTTP spec. That > said, this feels like it warrants something more than just being put in > the Best Practices doc. Maybe we need to start working on official > "side specs", too. Not sure what these should be called. > > Anyway, I've refamiliarised myself with robots.txt. Turns out it is > still only a de facto standard without an official RFC. My > understanding is based on: > > - The https://www.robotstxt.org/ website (which Wikipedia calls the > "Official website" at > https://en.wikipedia.org/wiki/Robots_exclusion_standard - it's not > clear to me what "official" means for a de facto standard), and in > particular: > > - An old draft RFC from 1996 which that site hosts at > https://www.robotstxt.org/norobots-rfc.txt > > - A new draft RFC from 2019 which appears to have gotten further than > the first, considering it is hosted by the IETF at > https://tools.ietf.org/html/draft-rep-wg-topic-00 > > While the 1996 draft is web-specific, I was pleasantly surprised to > see that the 2019 version is not. Section 2.3 says: > > > > As per RFC3986 [1], the URI of the robots.txt is: > > "scheme:[//authority]/robots.txt" > > For example, in the context of HTTP or FTP, the URI is: > > http://www.example.com/robots.txt > > https://www.example.com/robots.txt > > ftp://ftp.example.com/robots.txt > > So, why not Gemini too? > > Regarding the first practical question which was raised by Sean's recent > post, it seems a no-brainer to me that Gemini should retain the > convention of there being a single /robots.txt URL rather than having > them per-directory or anything like that. Which it now seems was the > intended behaviour of GUS all along, so I'm guessing nobody will find > this controversial (but speak up if you do). > > https://www.robotstxt.org/robotstxt.html claims that excluding all files > except one from robot access is "currently a bit awkward, as there is no > "Allow" field". However, both the old and new RFC drafts clearly > mention one. I am not sure exactly what the ground truth is here, in > terms of how often Allow is used in the wild or to what extent it is > obeyed even by well-intentiond bots. I would be very happy in principle > to just declare that Allow lines are valid for Gemini robot.txt files, > but if it turns out that popular programming languages have standard > library tools for parsing robot.txt which don't choke on gemini:// URLs > but don't recognise "Allow", this could quickly lead to unintended > consequences, so perhaps it is best to be conservative here. > > If anybody happens to be familiar with current practice on the web with > regard to Allow, please chime in. > > There is the question of caching. Both RFC drafts for robots.txt make > it clear that standard HTTP caching mechanisms apply to robots.txt, but > Gemini doesn't have an equivalent and I'm not interested in adding one > yet, especially not for the purposes of robots.txt. And yet, obviously, > some caching needs to take place. A spider requesting /robots.txt > again and again for every document at a host is generating a lot of > needless traffic. The 1996 RFC recommends "If no cache-control > directives are present robots should default to an expiry of 7 days", > while the 2019 one says "Crawlers SHOULD NOT use the cached version for > more than 24 hours, unless the robots.txt is unreachable". My gut tells > me most Gemini robots.txt files will change very infrequently and 7 days > is more appropriate than 24 hours, but I'm happy for us to discuss this. > > The biggest question, in my mind, is what to do about user-agents, which > Gemini lacks (by design, as they are a component of the browser > fingerprinting problem, and because they encourage content developers to > serve browser-specific content which is a bad thing IMHO). The 2019 RFC > says "The product token SHOULD be part of the identification string that > the crawler sends to the service" (where "product token" is bizarre and > disappointingly commercial alternative terminology for "user-agent" in > this document), so the fact that Gemini doesn't send one is not > technically a violation. > > Of course, a robot doesn't need to send its user-agent in order to > know its user-agent and interpet robots.txt accordingly. But it's > much harder for Gemini server admins than their web counterparts to know > exactly which bot is engaging in undesired behaviour and how to address > it. Currently, the only thing that seems achievable in Gemini is to use > the wildcard user-agent "*" to allow/disallow access by all bots to > particular resources. > > But not all bots are equal. I'm willing to bet there are people using > Gemini who are perfectly happy with e.g. the GUS search engine spider > crawling their site to make it searchable via a service which is offered > exclusively within Geminispace, but who are not happy with Gemini to web > proxies accessing their content because they are concerned that > poorly-written proxies will not disallow Google from crawling them so > that Gemini content ends up being searchable within webspace. This is a > perfectly reasonable stance to take and I think we should try to > facilitate it. > > With no Gemini-specific changes to the de facto robots.txt spec, this > would require admins to either manually maintain a whitelist of > Gemini-only search engine spiders in their robots.txt or a blacklist > of web proxies. This is easy today when you can count the number of > either of things on one hand, but it does not scale well and is not a > reasonable thing to expect admins to do in order to enforce a reasonable > stance. > > (and, really, this isn't a Gemini-specific problem and I'm surprised > that what I'm about to propose isn't a thing for the web) > > I have mentioned previously on this list (quickly, in passing), the idea > of "meta user-agents" (I didn't use that term when I first mentioned > it). But since there is no way for Gemini server admins to learn the > user-agent of arbitrary bots, we could define a small (I'm thinking ~5 > would suffice, surely 10 at most) number of pre-defined user-agents > which all bots of a given kind MUST respect (in addition to optionally > having their own individual user-agent). A very rough sketch of some > possibilities, not meant to be exhaustive or even very good, just to > give the flavour: > > - A user-agent of "webproxy" which must be respected by all web proxies. > Possibly this could have sub-types for proxies which do and don't > forbid web search engines? > > - A user-agent of "search" which must be respected by all search engine > spiders > > - A user-agent of "research" for bots which crawl a site without making > specific results of their crawl publically available (I've thought of > writing something like this to study the growth of Geminispace and the > structure of links between documents) > > Enumerating actual use cases is probably the wrong way to go about it, > rather we should think of broad classes of behaviour which differ with > regard to privacy implications - e.g. bots which don't make the results > of their crawling public, bots which make their results public over > Gemini only, bots which breach the Gemini-web barrier, etc. > > Do people think this is a good idea? > > Can anybody think of other things to consider in adapting robots.txt to > Gemini? > > Cheers, > Solderpunk >
solderpunk writes: > But since there is no way for Gemini server admins to learn the > user-agent of arbitrary bots, we could define a small (I'm thinking ~5 > would suffice, surely 10 at most) number of pre-defined user-agents > which all bots of a given kind MUST respect (in addition to optionally > having their own individual user-agent). A very rough sketch of some > possibilities, not meant to be exhaustive or even very good, just to > give the flavour: I think this is probably the right approach, since it doesn't require adding user-agents to the protocol. > * A user-agent of "webproxy" which must be respected by all web > proxies. Possibly this could have sub-types for proxies which do and > don't forbid web search engines? webproxy-bot and webproxy-nobot, perhaps. > * A user-agent of "search" which must be respected by all search > engine spiders > * A user-agent of "research" for bots which crawl a site without > making specific results of their crawl publically available (I've > thought of writing something like this to study the growth of > Geminispace and the structure of links between documents) Another type I can think of is "archive", for things that rehost existing gemini content elsewhere on gemini. Besides being another use case, this category also has the implication that it may make deleted content available (a la the Wayback Machine). -- +-----------------------------------------------------------+ | Jason F. McBrayer jmcbray at carcosa.net | | If someone conquers a thousand times a thousand others in | | battle, and someone else conquers himself, the latter one | | is the greatest of all conquerors. --- The Dhammapada |
On Sun, Mar 22, 2020 at 10:39:05PM +0000, Krixano wrote: > You can go to literally any website and append "/robots.txt" to see what they use. I've already seen a couple that use "Allow". Thanks for that. I've also verified that Python's stdlib function for parsing robots.txt recognises "Allow", so it seems that this is not as obscure an option as some sources suggest. I'd say we may as well explicitly support this for Gemini. Cheers, Solderpunk > > Christian Seibold > > Sent with ProtonMail Secure Email. > > ??????? Original Message ??????? > On Sunday, March 22, 2020 1:13 PM, solderpunk <solderpunk at SDF.ORG> wrote: > > > Howdy all, > > > > As the first and perhaps most important push toward getting some clear > > guidelines in place for well-behaved non-human Gemini clients (e.g. > > Gemini to web proxies, search engine spiders, feed aggregators, etc.), > > let's get to work on adapting robots.txt to Gemini. > > > > My current thinking is that this doesn't belong in the Gemini spec > > itself, much like robots.txt does not belong in the HTTP spec. That > > said, this feels like it warrants something more than just being put in > > the Best Practices doc. Maybe we need to start working on official > > "side specs", too. Not sure what these should be called. > > > > Anyway, I've refamiliarised myself with robots.txt. Turns out it is > > still only a de facto standard without an official RFC. My > > understanding is based on: > > > > - The https://www.robotstxt.org/ website (which Wikipedia calls the > > "Official website" at > > https://en.wikipedia.org/wiki/Robots_exclusion_standard - it's not > > clear to me what "official" means for a de facto standard), and in > > particular: > > > > - An old draft RFC from 1996 which that site hosts at > > https://www.robotstxt.org/norobots-rfc.txt > > > > - A new draft RFC from 2019 which appears to have gotten further than > > the first, considering it is hosted by the IETF at > > https://tools.ietf.org/html/draft-rep-wg-topic-00 > > > > While the 1996 draft is web-specific, I was pleasantly surprised to > > see that the 2019 version is not. Section 2.3 says: > > > > > > > As per RFC3986 [1], the URI of the robots.txt is: > > > "scheme:[//authority]/robots.txt" > > > For example, in the context of HTTP or FTP, the URI is: > > > http://www.example.com/robots.txt > > > https://www.example.com/robots.txt > > > ftp://ftp.example.com/robots.txt > > > > So, why not Gemini too? > > > > Regarding the first practical question which was raised by Sean's recent > > post, it seems a no-brainer to me that Gemini should retain the > > convention of there being a single /robots.txt URL rather than having > > them per-directory or anything like that. Which it now seems was the > > intended behaviour of GUS all along, so I'm guessing nobody will find > > this controversial (but speak up if you do). > > > > https://www.robotstxt.org/robotstxt.html claims that excluding all files > > except one from robot access is "currently a bit awkward, as there is no > > "Allow" field". However, both the old and new RFC drafts clearly > > mention one. I am not sure exactly what the ground truth is here, in > > terms of how often Allow is used in the wild or to what extent it is > > obeyed even by well-intentiond bots. I would be very happy in principle > > to just declare that Allow lines are valid for Gemini robot.txt files, > > but if it turns out that popular programming languages have standard > > library tools for parsing robot.txt which don't choke on gemini:// URLs > > but don't recognise "Allow", this could quickly lead to unintended > > consequences, so perhaps it is best to be conservative here. > > > > If anybody happens to be familiar with current practice on the web with > > regard to Allow, please chime in. > > > > There is the question of caching. Both RFC drafts for robots.txt make > > it clear that standard HTTP caching mechanisms apply to robots.txt, but > > Gemini doesn't have an equivalent and I'm not interested in adding one > > yet, especially not for the purposes of robots.txt. And yet, obviously, > > some caching needs to take place. A spider requesting /robots.txt > > again and again for every document at a host is generating a lot of > > needless traffic. The 1996 RFC recommends "If no cache-control > > directives are present robots should default to an expiry of 7 days", > > while the 2019 one says "Crawlers SHOULD NOT use the cached version for > > more than 24 hours, unless the robots.txt is unreachable". My gut tells > > me most Gemini robots.txt files will change very infrequently and 7 days > > is more appropriate than 24 hours, but I'm happy for us to discuss this. > > > > The biggest question, in my mind, is what to do about user-agents, which > > Gemini lacks (by design, as they are a component of the browser > > fingerprinting problem, and because they encourage content developers to > > serve browser-specific content which is a bad thing IMHO). The 2019 RFC > > says "The product token SHOULD be part of the identification string that > > the crawler sends to the service" (where "product token" is bizarre and > > disappointingly commercial alternative terminology for "user-agent" in > > this document), so the fact that Gemini doesn't send one is not > > technically a violation. > > > > Of course, a robot doesn't need to send its user-agent in order to > > know its user-agent and interpet robots.txt accordingly. But it's > > much harder for Gemini server admins than their web counterparts to know > > exactly which bot is engaging in undesired behaviour and how to address > > it. Currently, the only thing that seems achievable in Gemini is to use > > the wildcard user-agent "*" to allow/disallow access by all bots to > > particular resources. > > > > But not all bots are equal. I'm willing to bet there are people using > > Gemini who are perfectly happy with e.g. the GUS search engine spider > > crawling their site to make it searchable via a service which is offered > > exclusively within Geminispace, but who are not happy with Gemini to web > > proxies accessing their content because they are concerned that > > poorly-written proxies will not disallow Google from crawling them so > > that Gemini content ends up being searchable within webspace. This is a > > perfectly reasonable stance to take and I think we should try to > > facilitate it. > > > > With no Gemini-specific changes to the de facto robots.txt spec, this > > would require admins to either manually maintain a whitelist of > > Gemini-only search engine spiders in their robots.txt or a blacklist > > of web proxies. This is easy today when you can count the number of > > either of things on one hand, but it does not scale well and is not a > > reasonable thing to expect admins to do in order to enforce a reasonable > > stance. > > > > (and, really, this isn't a Gemini-specific problem and I'm surprised > > that what I'm about to propose isn't a thing for the web) > > > > I have mentioned previously on this list (quickly, in passing), the idea > > of "meta user-agents" (I didn't use that term when I first mentioned > > it). But since there is no way for Gemini server admins to learn the > > user-agent of arbitrary bots, we could define a small (I'm thinking ~5 > > would suffice, surely 10 at most) number of pre-defined user-agents > > which all bots of a given kind MUST respect (in addition to optionally > > having their own individual user-agent). A very rough sketch of some > > possibilities, not meant to be exhaustive or even very good, just to > > give the flavour: > > > > - A user-agent of "webproxy" which must be respected by all web proxies. > > Possibly this could have sub-types for proxies which do and don't > > forbid web search engines? > > > > - A user-agent of "search" which must be respected by all search engine > > spiders > > > > - A user-agent of "research" for bots which crawl a site without making > > specific results of their crawl publically available (I've thought of > > writing something like this to study the growth of Geminispace and the > > structure of links between documents) > > > > Enumerating actual use cases is probably the wrong way to go about it, > > rather we should think of broad classes of behaviour which differ with > > regard to privacy implications - e.g. bots which don't make the results > > of their crawling public, bots which make their results public over > > Gemini only, bots which breach the Gemini-web barrier, etc. > > > > Do people think this is a good idea? > > > > Can anybody think of other things to consider in adapting robots.txt to > > Gemini? > > > > Cheers, > > Solderpunk > > > > >
On Tue, Mar 24, 2020 at 08:47:53AM -0400, Jason McBrayer wrote: > Another type I can think of is "archive", for things that rehost > existing gemini content elsewhere on gemini. Besides being another use > case, this category also has the implication that it may make deleted > content available (a la the Wayback Machine). Yes, certainly! This is actually quite an important use case. Many folks are unhappy that archive.org no longer respects robots.txt, while on the other hand the archive.org folks argue that people were writing robots.txt rules based on how they wanted search engine robots ot act rather than considering archive bots. Making it easy for Gemini server admins to explicitly set different policies for the two kinds of bots (if they want to!) seems a substantial improvement. Cheers, Solderpunk
It was thus said that the Great solderpunk once stated: > The biggest question, in my mind, is what to do about user-agents, which > Gemini lacks (by design, as they are a component of the browser > fingerprinting problem, and because they encourage content developers to > serve browser-specific content which is a bad thing IMHO). The 2019 RFC > says "The product token SHOULD be part of the identification string that > the crawler sends to the service" (where "product token" is bizarre and > disappointingly commercial alternative terminology for "user-agent" in > this document), so the fact that Gemini doesn't send one is not > technically a violation. Two possible solutions for robot identification: 1) Allow IP addresses to be used where a user-agent would be specificifed. Some examples: User-agent: 172.16.89.3 User-agent: 172.17.24.0/27 User-agent: fde7:a680:47d3/48 Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specify a range of IP addresses. And for a robot, if your IP addresss matches an IP address (or range), then you need to follow the following rules. 2) Use the fragment portion of a URL to designate a robot. The fragment portion of a URL has no meaning for a server (it does for a client). A robot could use this fact to skip it its identifier when making a request. The server MUST NOT use this information, but the logs could show it. For example, a robot could request: gemini://example.com/robots.txt#GUS A review of the logs would reveal that GUS is a robot, and the text "GUS" could be placed in the User-agent: field to control it. It SHOULD be the text the robot would recognize in robots.txt. One clarification, this: gemini://example.com/robots.txt#foo%20bot would be User-agent: foo bot but a robot ID SHOULD NOT contain spaces---it SHOULD be one word. Anyway, that's my ideas. -spc
On Tue, Mar 24, 2020 at 05:35:08PM -0400, Sean Conner wrote: > Two possible solutions for robot identification: > > 1) Allow IP addresses to be used where a user-agent would be specificifed. > Some examples: > > User-agent: 172.16.89.3 > User-agent: 172.17.24.0/27 > User-agent: fde7:a680:47d3/48 > > Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specify > a range of IP addresses. And for a robot, if your IP addresss matches an IP > address (or range), then you need to follow the following rules. Hmm, I'm not a huge fan of this idea (although I recognise it as a valid technical solution to the problem at hand, which is perhaps all you meant it to be). Mostly because I don't like to encourage people to think of IP addresses as permanently mapping to, well, just anything. The address of a VPN running an abusive bot today might be handed out to a different customer running a well-behaved bot next year. > 2) Use the fragment portion of a URL to designate a robot. The fragment > portion of a URL has no meaning for a server (it does for a client). A > robot could use this fact to skip it its identifier when making a request. > The server MUST NOT use this information, but the logs could show it. For > example, a robot could request: > > gemini://example.com/robots.txt#GUS > > A review of the logs would reveal that GUS is a robot, and the text "GUS" > could be placed in the User-agent: field to control it. It SHOULD be the > text the robot would recognize in robots.txt. Hmm, nice out-of-the-box thinking. Since the suggestion has come from you I will assume it does not violate the letter of any RFCs, even though I can't shake a strange feeling that this is "abusing" the fragment concept a little... Cheers, Solderpunk
It was thus said that the Great solderpunk once stated: > On Tue, Mar 24, 2020 at 05:35:08PM -0400, Sean Conner wrote: > > > Two possible solutions for robot identification: > > > > 1) Allow IP addresses to be used where a user-agent would be specificifed. > > Some examples: > > > > User-agent: 172.16.89.3 > > User-agent: 172.17.24.0/27 > > User-agent: fde7:a680:47d3/48 > > > > Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specify > > a range of IP addresses. And for a robot, if your IP addresss matches an IP > > address (or range), then you need to follow the following rules. > > Hmm, I'm not a huge fan of this idea (although I recognise it as a valid > technical solution to the problem at hand, which is perhaps all you > meant it to be). Pretty much. > Mostly because I don't like to encourage people to > think of IP addresses as permanently mapping to, well, just anything. > The address of a VPN running an abusive bot today might be handed out to > a different customer running a well-behaved bot next year. Fair enough. I'm just throwing out ideas here. > > 2) Use the fragment portion of a URL to designate a robot. The fragment > > portion of a URL has no meaning for a server (it does for a client). A > > robot could use this fact to skip it its identifier when making a request. > > The server MUST NOT use this information, but the logs could show it. For > > example, a robot could request: > > > > gemini://example.com/robots.txt#GUS > > > > A review of the logs would reveal that GUS is a robot, and the text "GUS" > > could be placed in the User-agent: field to control it. It SHOULD be the > > text the robot would recognize in robots.txt. > > Hmm, nice out-of-the-box thinking. Since the suggestion has come from > you I will assume it does not violate the letter of any RFCs, even > though I can't shake a strange feeling that this is "abusing" the > fragment concept a little... Well ... it's skating right up to the line, and may be going over it a bit. RFC-3986 says this about fragments: The fragment identifier component of a URI allows indirect identification of a secondary resource by reference to a primary resource and additional identifying information. The identified secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations. ... and so on. An argument could be made that a request like: gemini://example.com/robots.txt#Foobot could apply, as it is "referencing" the "Foobot" section of robots.txt, but such a claim would only be applicable to /robots.txt and not other resources on the server. Perhaps this could be just limited to references to /robots.txt? So yes, on the line here. And yes, it's "abusing" the fragment concept a little ... but other than these two methods, how else would one identify a robot on Gemini? -spc
---