robots.txt for Gemini

1. solderpunk (solderpunk (a) SDF.ORG)

Howdy all,

As the first and perhaps most important push toward getting some clear
guidelines in place for well-behaved non-human Gemini clients (e.g.
Gemini to web proxies, search engine spiders, feed aggregators, etc.),
let's get to work on adapting robots.txt to Gemini.

My current thinking is that this doesn't belong in the Gemini spec
itself, much like robots.txt does not belong in the HTTP spec.  That
said, this feels like it warrants something more than just being put in
the Best Practices doc.  Maybe we need to start working on official
"side specs", too.  Not sure what these should be called.

Anyway, I've refamiliarised myself with robots.txt.  Turns out it is
still only a de facto standard without an official RFC.  My
understanding is based on:


  "Official website" at
  https://en.wikipedia.org/wiki/Robots_exclusion_standard - it's not
  clear to me what "official" means for a de facto standard), and in
  particular:

  https://www.robotstxt.org/norobots-rfc.txt

  the first, considering it is hosted by the IETF at
  https://tools.ietf.org/html/draft-rep-wg-topic-00

While the 1996 draft is web-specific, I was pleasantly surprised to
see that the 2019 version is not.  Section 2.3 says:

> As per RFC3986 [1], the URI of the robots.txt is:
> 
> "scheme:[//authority]/robots.txt"
> 
> For example, in the context of HTTP or FTP, the URI is:
> 
> http://www.example.com/robots.txt
> 
> https://www.example.com/robots.txt
> 
> ftp://ftp.example.com/robots.txt

So, why not Gemini too?

Regarding the first practical question which was raised by Sean's recent
post, it seems a no-brainer to me that Gemini should retain the
convention of there being a single /robots.txt URL rather than having
them per-directory or anything like that.  Which it now seems was the
intended behaviour of GUS all along, so I'm guessing nobody will find
this controversial (but speak up if you do).

https://www.robotstxt.org/robotstxt.html claims that excluding all files
except one from robot access is "currently a bit awkward, as there is no
"Allow" field".  However, both the old and new RFC drafts clearly
mention one.  I am not sure exactly what the ground truth is here, in
terms of how often Allow is used in the wild or to what extent it is
obeyed even by well-intentiond bots.  I would be very happy in principle
to just declare that Allow lines are valid for Gemini robot.txt files,
but if it turns out that popular programming languages have standard
library tools for parsing robot.txt which don't choke on gemini:// URLs
but don't recognise "Allow", this could quickly lead to unintended
consequences, so perhaps it is best to be conservative here.

If anybody happens to be familiar with current practice on the web with
regard to Allow, please chime in.

There is the question of caching.  Both RFC drafts for robots.txt make
it clear that standard HTTP caching mechanisms apply to robots.txt, but
Gemini doesn't have an equivalent and I'm not interested in adding one
yet, especially not for the purposes of robots.txt.  And yet, obviously,

again and again for every document at a host is generating a lot of
needless traffic.  The 1996 RFC recommends "If no cache-control
directives are present robots should default to an expiry of 7 days",
while the 2019 one says "Crawlers SHOULD NOT use the cached version for
more than 24 hours, unless the robots.txt is unreachable".  My gut tells
me most Gemini robots.txt files will change very infrequently and 7 days
is more appropriate than 24 hours, but I'm happy for us to discuss this.

The biggest question, in my mind, is what to do about user-agents, which
Gemini lacks (by design, as they are a component of the browser
fingerprinting problem, and because they encourage content developers to
serve browser-specific content which is a bad thing IMHO).  The 2019 RFC
says "The product token SHOULD be part of the identification string that
the crawler sends to the service" (where "product token" is bizarre and
disappointingly commercial alternative terminology for "user-agent" in
this document), so the fact that Gemini doesn't send one is not
technically a violation.

Of course, a robot doesn't need to *send* its user-agent in order to

much harder for Gemini server admins than their web counterparts to know
exactly which bot is engaging in undesired behaviour and how to address
it.  Currently, the only thing that seems achievable in Gemini is to use
the wildcard user-agent "*" to allow/disallow access by *all* bots to
particular resources.

But not all bots are equal.  I'm willing to bet there are people using
Gemini who are perfectly happy with e.g. the GUS search engine spider
crawling their site to make it searchable via a service which is offered
exclusively within Geminispace, but who are not happy with Gemini to web
proxies accessing their content because they are concerned that
poorly-written proxies will not disallow Google from crawling them so
that Gemini content ends up being searchable within webspace.  This is a
perfectly reasonable stance to take and I think we should try to
facilitate it.

With no Gemini-specific changes to the de facto robots.txt spec, this
would require admins to either manually maintain a whitelist of
Gemini-only search engine spiders in their robots.txt *or* a blacklist
of web proxies.  This is easy today when you can count the number of
either of things on one hand, but it does not scale well and is not a
reasonable thing to expect admins to do in order to enforce a reasonable
stance.

(and, really, this isn't a Gemini-specific problem and I'm surprised
that what I'm about to propose isn't a thing for the web)

I have mentioned previously on this list (quickly, in passing), the idea
of "meta user-agents" (I didn't use that term when I first mentioned
it).  But since there is no way for Gemini server admins to learn the
user-agent of arbitrary bots, we could define a small (I'm thinking ~5
would suffice, surely 10 at most) number of pre-defined user-agents
which all bots of a given kind MUST respect (in addition to optionally
having their own individual user-agent).  A very rough sketch of some
possibilities, not meant to be exhaustive or even very good, just to
give the flavour:


  Possibly this could have sub-types for proxies which do and don't
  forbid web search engines?

  spiders

  specific results of their crawl publically available (I've thought of
  writing something like this to study the growth of Geminispace and the
  structure of links between documents)

Enumerating actual use cases is probably the wrong way to go about it,
rather we should think of broad classes of behaviour which differ with
regard to privacy implications - e.g. bots which don't make the results
of their crawling public, bots which make their results public over
Gemini only, bots which breach the Gemini-web barrier, etc.

Do people think this is a good idea?

Can anybody think of other things to consider in adapting robots.txt to
Gemini?

Cheers,
Solderpunk

Link to individual message.

2. Krixano (krixano (a) protonmail.com)

You can go to literally any website and append "/robots.txt" to see what 
they use. I've already seen a couple that use "Allow".


Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????
On Sunday, March 22, 2020 1:13 PM, solderpunk <solderpunk at SDF.ORG> wrote:

> Howdy all,
>
> As the first and perhaps most important push toward getting some clear
> guidelines in place for well-behaved non-human Gemini clients (e.g.
> Gemini to web proxies, search engine spiders, feed aggregators, etc.),
> let's get to work on adapting robots.txt to Gemini.
>
> My current thinking is that this doesn't belong in the Gemini spec
> itself, much like robots.txt does not belong in the HTTP spec. That
> said, this feels like it warrants something more than just being put in
> the Best Practices doc. Maybe we need to start working on official
> "side specs", too. Not sure what these should be called.
>
> Anyway, I've refamiliarised myself with robots.txt. Turns out it is
> still only a de facto standard without an official RFC. My
> understanding is based on:
>
> -   The https://www.robotstxt.org/ website (which Wikipedia calls the
>     "Official website" at
>     https://en.wikipedia.org/wiki/Robots_exclusion_standard - it's not
>     clear to me what "official" means for a de facto standard), and in
>     particular:
>
> -   An old draft RFC from 1996 which that site hosts at
>     https://www.robotstxt.org/norobots-rfc.txt
>
> -   A new draft RFC from 2019 which appears to have gotten further than
>     the first, considering it is hosted by the IETF at
>     https://tools.ietf.org/html/draft-rep-wg-topic-00
>
>     While the 1996 draft is web-specific, I was pleasantly surprised to
>     see that the 2019 version is not. Section 2.3 says:
>
>
> > As per RFC3986 [1], the URI of the robots.txt is:
> > "scheme:[//authority]/robots.txt"
> > For example, in the context of HTTP or FTP, the URI is:
> > http://www.example.com/robots.txt
> > https://www.example.com/robots.txt
> > ftp://ftp.example.com/robots.txt
>
> So, why not Gemini too?
>
> Regarding the first practical question which was raised by Sean's recent
> post, it seems a no-brainer to me that Gemini should retain the
> convention of there being a single /robots.txt URL rather than having
> them per-directory or anything like that. Which it now seems was the
> intended behaviour of GUS all along, so I'm guessing nobody will find
> this controversial (but speak up if you do).
>
> https://www.robotstxt.org/robotstxt.html claims that excluding all files
> except one from robot access is "currently a bit awkward, as there is no
> "Allow" field". However, both the old and new RFC drafts clearly
> mention one. I am not sure exactly what the ground truth is here, in
> terms of how often Allow is used in the wild or to what extent it is
> obeyed even by well-intentiond bots. I would be very happy in principle
> to just declare that Allow lines are valid for Gemini robot.txt files,
> but if it turns out that popular programming languages have standard
> library tools for parsing robot.txt which don't choke on gemini:// URLs
> but don't recognise "Allow", this could quickly lead to unintended
> consequences, so perhaps it is best to be conservative here.
>
> If anybody happens to be familiar with current practice on the web with
> regard to Allow, please chime in.
>
> There is the question of caching. Both RFC drafts for robots.txt make
> it clear that standard HTTP caching mechanisms apply to robots.txt, but
> Gemini doesn't have an equivalent and I'm not interested in adding one
> yet, especially not for the purposes of robots.txt. And yet, obviously,
> some caching needs to take place. A spider requesting /robots.txt
> again and again for every document at a host is generating a lot of
> needless traffic. The 1996 RFC recommends "If no cache-control
> directives are present robots should default to an expiry of 7 days",
> while the 2019 one says "Crawlers SHOULD NOT use the cached version for
> more than 24 hours, unless the robots.txt is unreachable". My gut tells
> me most Gemini robots.txt files will change very infrequently and 7 days
> is more appropriate than 24 hours, but I'm happy for us to discuss this.
>
> The biggest question, in my mind, is what to do about user-agents, which
> Gemini lacks (by design, as they are a component of the browser
> fingerprinting problem, and because they encourage content developers to
> serve browser-specific content which is a bad thing IMHO). The 2019 RFC
> says "The product token SHOULD be part of the identification string that
> the crawler sends to the service" (where "product token" is bizarre and
> disappointingly commercial alternative terminology for "user-agent" in
> this document), so the fact that Gemini doesn't send one is not
> technically a violation.
>
> Of course, a robot doesn't need to send its user-agent in order to
> know its user-agent and interpet robots.txt accordingly. But it's
> much harder for Gemini server admins than their web counterparts to know
> exactly which bot is engaging in undesired behaviour and how to address
> it. Currently, the only thing that seems achievable in Gemini is to use
> the wildcard user-agent "*" to allow/disallow access by all bots to
> particular resources.
>
> But not all bots are equal. I'm willing to bet there are people using
> Gemini who are perfectly happy with e.g. the GUS search engine spider
> crawling their site to make it searchable via a service which is offered
> exclusively within Geminispace, but who are not happy with Gemini to web
> proxies accessing their content because they are concerned that
> poorly-written proxies will not disallow Google from crawling them so
> that Gemini content ends up being searchable within webspace. This is a
> perfectly reasonable stance to take and I think we should try to
> facilitate it.
>
> With no Gemini-specific changes to the de facto robots.txt spec, this
> would require admins to either manually maintain a whitelist of
> Gemini-only search engine spiders in their robots.txt or a blacklist
> of web proxies. This is easy today when you can count the number of
> either of things on one hand, but it does not scale well and is not a
> reasonable thing to expect admins to do in order to enforce a reasonable
> stance.
>
> (and, really, this isn't a Gemini-specific problem and I'm surprised
> that what I'm about to propose isn't a thing for the web)
>
> I have mentioned previously on this list (quickly, in passing), the idea
> of "meta user-agents" (I didn't use that term when I first mentioned
> it). But since there is no way for Gemini server admins to learn the
> user-agent of arbitrary bots, we could define a small (I'm thinking ~5
> would suffice, surely 10 at most) number of pre-defined user-agents
> which all bots of a given kind MUST respect (in addition to optionally
> having their own individual user-agent). A very rough sketch of some
> possibilities, not meant to be exhaustive or even very good, just to
> give the flavour:
>
> -   A user-agent of "webproxy" which must be respected by all web proxies.
>     Possibly this could have sub-types for proxies which do and don't
>     forbid web search engines?
>
> -   A user-agent of "search" which must be respected by all search engine
>     spiders
>
> -   A user-agent of "research" for bots which crawl a site without making
>     specific results of their crawl publically available (I've thought of
>     writing something like this to study the growth of Geminispace and the
>     structure of links between documents)
>
>     Enumerating actual use cases is probably the wrong way to go about it,
>     rather we should think of broad classes of behaviour which differ with
>     regard to privacy implications - e.g. bots which don't make the results
>     of their crawling public, bots which make their results public over
>     Gemini only, bots which breach the Gemini-web barrier, etc.
>
>     Do people think this is a good idea?
>
>     Can anybody think of other things to consider in adapting robots.txt to
>     Gemini?
>
>     Cheers,
>     Solderpunk
>

Link to individual message.

3. Jason McBrayer (jmcbray (a) dorothy.carcosa.net)


solderpunk writes:

> But since there is no way for Gemini server admins to learn the
> user-agent of arbitrary bots, we could define a small (I'm thinking ~5
> would suffice, surely 10 at most) number of pre-defined user-agents
> which all bots of a given kind MUST respect (in addition to optionally
> having their own individual user-agent). A very rough sketch of some
> possibilities, not meant to be exhaustive or even very good, just to
> give the flavour:

I think this is probably the right approach, since it doesn't require
adding user-agents to the protocol.

> * A user-agent of "webproxy" which must be respected by all web
> proxies. Possibly this could have sub-types for proxies which do and
> don't forbid web search engines?

webproxy-bot and webproxy-nobot, perhaps.

> * A user-agent of "search" which must be respected by all search
> engine spiders

> * A user-agent of "research" for bots which crawl a site without
> making specific results of their crawl publically available (I've
> thought of writing something like this to study the growth of
> Geminispace and the structure of links between documents)

Another type I can think of is "archive", for things that rehost
existing gemini content elsewhere on gemini. Besides being another use
case, this category also has the implication that it may make deleted
content available (a la the Wayback Machine).

--
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| If someone conquers a thousand times a thousand others in |
| battle, and someone else conquers himself, the latter one |
| is the greatest of all conquerors.  --- The Dhammapada    |

Link to individual message.

4. solderpunk (solderpunk (a) SDF.ORG)

On Sun, Mar 22, 2020 at 10:39:05PM +0000, Krixano wrote:
> You can go to literally any website and append "/robots.txt" to see what 
they use. I've already seen a couple that use "Allow".

Thanks for that. I've also verified that Python's stdlib function for
parsing robots.txt recognises "Allow", so it seems that this is not as
obscure an option as some sources suggest.  I'd say we may as well
explicitly support this for Gemini.

Cheers,
Solderpunk


> 
> Christian Seibold
> 
> Sent with ProtonMail Secure Email.
> 
> ??????? Original Message ???????
> On Sunday, March 22, 2020 1:13 PM, solderpunk <solderpunk at SDF.ORG> wrote:
> 
> > Howdy all,
> >
> > As the first and perhaps most important push toward getting some clear
> > guidelines in place for well-behaved non-human Gemini clients (e.g.
> > Gemini to web proxies, search engine spiders, feed aggregators, etc.),
> > let's get to work on adapting robots.txt to Gemini.
> >
> > My current thinking is that this doesn't belong in the Gemini spec
> > itself, much like robots.txt does not belong in the HTTP spec. That
> > said, this feels like it warrants something more than just being put in
> > the Best Practices doc. Maybe we need to start working on official
> > "side specs", too. Not sure what these should be called.
> >
> > Anyway, I've refamiliarised myself with robots.txt. Turns out it is
> > still only a de facto standard without an official RFC. My
> > understanding is based on:
> >
> > -   The https://www.robotstxt.org/ website (which Wikipedia calls the
> >     "Official website" at
> >     https://en.wikipedia.org/wiki/Robots_exclusion_standard - it's not
> >     clear to me what "official" means for a de facto standard), and in
> >     particular:
> >
> > -   An old draft RFC from 1996 which that site hosts at
> >     https://www.robotstxt.org/norobots-rfc.txt
> >
> > -   A new draft RFC from 2019 which appears to have gotten further than
> >     the first, considering it is hosted by the IETF at
> >     https://tools.ietf.org/html/draft-rep-wg-topic-00
> >
> >     While the 1996 draft is web-specific, I was pleasantly surprised to
> >     see that the 2019 version is not. Section 2.3 says:
> >
> >
> > > As per RFC3986 [1], the URI of the robots.txt is:
> > > "scheme:[//authority]/robots.txt"
> > > For example, in the context of HTTP or FTP, the URI is:
> > > http://www.example.com/robots.txt
> > > https://www.example.com/robots.txt
> > > ftp://ftp.example.com/robots.txt
> >
> > So, why not Gemini too?
> >
> > Regarding the first practical question which was raised by Sean's recent
> > post, it seems a no-brainer to me that Gemini should retain the
> > convention of there being a single /robots.txt URL rather than having
> > them per-directory or anything like that. Which it now seems was the
> > intended behaviour of GUS all along, so I'm guessing nobody will find
> > this controversial (but speak up if you do).
> >
> > https://www.robotstxt.org/robotstxt.html claims that excluding all files
> > except one from robot access is "currently a bit awkward, as there is no
> > "Allow" field". However, both the old and new RFC drafts clearly
> > mention one. I am not sure exactly what the ground truth is here, in
> > terms of how often Allow is used in the wild or to what extent it is
> > obeyed even by well-intentiond bots. I would be very happy in principle
> > to just declare that Allow lines are valid for Gemini robot.txt files,
> > but if it turns out that popular programming languages have standard
> > library tools for parsing robot.txt which don't choke on gemini:// URLs
> > but don't recognise "Allow", this could quickly lead to unintended
> > consequences, so perhaps it is best to be conservative here.
> >
> > If anybody happens to be familiar with current practice on the web with
> > regard to Allow, please chime in.
> >
> > There is the question of caching. Both RFC drafts for robots.txt make
> > it clear that standard HTTP caching mechanisms apply to robots.txt, but
> > Gemini doesn't have an equivalent and I'm not interested in adding one
> > yet, especially not for the purposes of robots.txt. And yet, obviously,
> > some caching needs to take place. A spider requesting /robots.txt
> > again and again for every document at a host is generating a lot of
> > needless traffic. The 1996 RFC recommends "If no cache-control
> > directives are present robots should default to an expiry of 7 days",
> > while the 2019 one says "Crawlers SHOULD NOT use the cached version for
> > more than 24 hours, unless the robots.txt is unreachable". My gut tells
> > me most Gemini robots.txt files will change very infrequently and 7 days
> > is more appropriate than 24 hours, but I'm happy for us to discuss this.
> >
> > The biggest question, in my mind, is what to do about user-agents, which
> > Gemini lacks (by design, as they are a component of the browser
> > fingerprinting problem, and because they encourage content developers to
> > serve browser-specific content which is a bad thing IMHO). The 2019 RFC
> > says "The product token SHOULD be part of the identification string that
> > the crawler sends to the service" (where "product token" is bizarre and
> > disappointingly commercial alternative terminology for "user-agent" in
> > this document), so the fact that Gemini doesn't send one is not
> > technically a violation.
> >
> > Of course, a robot doesn't need to send its user-agent in order to
> > know its user-agent and interpet robots.txt accordingly. But it's
> > much harder for Gemini server admins than their web counterparts to know
> > exactly which bot is engaging in undesired behaviour and how to address
> > it. Currently, the only thing that seems achievable in Gemini is to use
> > the wildcard user-agent "*" to allow/disallow access by all bots to
> > particular resources.
> >
> > But not all bots are equal. I'm willing to bet there are people using
> > Gemini who are perfectly happy with e.g. the GUS search engine spider
> > crawling their site to make it searchable via a service which is offered
> > exclusively within Geminispace, but who are not happy with Gemini to web
> > proxies accessing their content because they are concerned that
> > poorly-written proxies will not disallow Google from crawling them so
> > that Gemini content ends up being searchable within webspace. This is a
> > perfectly reasonable stance to take and I think we should try to
> > facilitate it.
> >
> > With no Gemini-specific changes to the de facto robots.txt spec, this
> > would require admins to either manually maintain a whitelist of
> > Gemini-only search engine spiders in their robots.txt or a blacklist
> > of web proxies. This is easy today when you can count the number of
> > either of things on one hand, but it does not scale well and is not a
> > reasonable thing to expect admins to do in order to enforce a reasonable
> > stance.
> >
> > (and, really, this isn't a Gemini-specific problem and I'm surprised
> > that what I'm about to propose isn't a thing for the web)
> >
> > I have mentioned previously on this list (quickly, in passing), the idea
> > of "meta user-agents" (I didn't use that term when I first mentioned
> > it). But since there is no way for Gemini server admins to learn the
> > user-agent of arbitrary bots, we could define a small (I'm thinking ~5
> > would suffice, surely 10 at most) number of pre-defined user-agents
> > which all bots of a given kind MUST respect (in addition to optionally
> > having their own individual user-agent). A very rough sketch of some
> > possibilities, not meant to be exhaustive or even very good, just to
> > give the flavour:
> >
> > -   A user-agent of "webproxy" which must be respected by all web proxies.
> >     Possibly this could have sub-types for proxies which do and don't
> >     forbid web search engines?
> >
> > -   A user-agent of "search" which must be respected by all search engine
> >     spiders
> >
> > -   A user-agent of "research" for bots which crawl a site without making
> >     specific results of their crawl publically available (I've thought of
> >     writing something like this to study the growth of Geminispace and the
> >     structure of links between documents)
> >
> >     Enumerating actual use cases is probably the wrong way to go about it,
> >     rather we should think of broad classes of behaviour which differ with
> >     regard to privacy implications - e.g. bots which don't make the results
> >     of their crawling public, bots which make their results public over
> >     Gemini only, bots which breach the Gemini-web barrier, etc.
> >
> >     Do people think this is a good idea?
> >
> >     Can anybody think of other things to consider in adapting robots.txt to
> >     Gemini?
> >
> >     Cheers,
> >     Solderpunk
> >
> 
> 
>

Link to individual message.

5. solderpunk (solderpunk (a) SDF.ORG)

On Tue, Mar 24, 2020 at 08:47:53AM -0400, Jason McBrayer wrote:
 
> Another type I can think of is "archive", for things that rehost
> existing gemini content elsewhere on gemini. Besides being another use
> case, this category also has the implication that it may make deleted
> content available (a la the Wayback Machine).

Yes, certainly!  This is actually quite an important use case.  Many
folks are unhappy that archive.org no longer respects robots.txt, while
on the other hand the archive.org folks argue that people were writing
robots.txt rules based on how they wanted search engine robots ot act
rather than considering archive bots.  Making it easy for Gemini server
admins to explicitly set different policies for the two kinds of bots
(if they want to!) seems a substantial improvement.

Cheers,
Solderpunk

Link to individual message.

6. Sean Conner (sean (a) conman.org)

It was thus said that the Great solderpunk once stated:
> The biggest question, in my mind, is what to do about user-agents, which
> Gemini lacks (by design, as they are a component of the browser
> fingerprinting problem, and because they encourage content developers to
> serve browser-specific content which is a bad thing IMHO).  The 2019 RFC
> says "The product token SHOULD be part of the identification string that
> the crawler sends to the service" (where "product token" is bizarre and
> disappointingly commercial alternative terminology for "user-agent" in
> this document), so the fact that Gemini doesn't send one is not
> technically a violation.

  Two possible solutions for robot identification:

1) Allow IP addresses to be used where a user-agent would be specificifed. 
Some examples:

	User-agent: 172.16.89.3
	User-agent: 172.17.24.0/27
	User-agent: fde7:a680:47d3/48

Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specify
a range of IP addresses.  And for a robot, if your IP addresss matches an IP
address (or range), then you need to follow the following rules.

2) Use the fragment portion of a URL to designate a robot.  The fragment
portion of a URL has no meaning for a server (it does for a client).  A
robot could use this fact to skip it its identifier when making a request. 
The server MUST NOT use this information, but the logs could show it.  For
example, a robot could request:

	gemini://example.com/robots.txt#GUS

A review of the logs would reveal that GUS is a robot, and the text "GUS"
could be placed in the User-agent: field to control it.  It SHOULD be the
text the robot would recognize in robots.txt.  One clarification, this:

	gemini://example.com/robots.txt#foo%20bot

would be 

	User-agent: foo bot

but a robot ID SHOULD NOT contain spaces---it SHOULD be one word.

  Anyway, that's my ideas.

  -spc

Link to individual message.

7. solderpunk (solderpunk (a) SDF.ORG)

On Tue, Mar 24, 2020 at 05:35:08PM -0400, Sean Conner wrote:
 
>   Two possible solutions for robot identification:
> 
> 1) Allow IP addresses to be used where a user-agent would be specificifed. 
> Some examples:
> 
> 	User-agent: 172.16.89.3
> 	User-agent: 172.17.24.0/27
> 	User-agent: fde7:a680:47d3/48
> 
> Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specify
> a range of IP addresses.  And for a robot, if your IP addresss matches an IP
> address (or range), then you need to follow the following rules.

Hmm, I'm not a huge fan of this idea (although I recognise it as a valid
technical solution to the problem at hand, which is perhaps all you
meant it to be).  Mostly because I don't like to encourage people to
think of IP addresses as permanently mapping to, well, just anything.
The address of a VPN running an abusive bot today might be handed out to
a different customer running a well-behaved bot next year.
 
> 2) Use the fragment portion of a URL to designate a robot.  The fragment
> portion of a URL has no meaning for a server (it does for a client).  A
> robot could use this fact to skip it its identifier when making a request. 
> The server MUST NOT use this information, but the logs could show it.  For
> example, a robot could request:
> 
> 	gemini://example.com/robots.txt#GUS
> 
> A review of the logs would reveal that GUS is a robot, and the text "GUS"
> could be placed in the User-agent: field to control it.  It SHOULD be the
> text the robot would recognize in robots.txt.

Hmm, nice out-of-the-box thinking.  Since the suggestion has come from
you I will assume it does not violate the letter of any RFCs, even
though I can't shake a strange feeling that this is "abusing" the
fragment concept a little...

Cheers,
Solderpunk

Link to individual message.

8. Sean Conner (sean (a) conman.org)

It was thus said that the Great solderpunk once stated:
> On Tue, Mar 24, 2020 at 05:35:08PM -0400, Sean Conner wrote:
>  
> >   Two possible solutions for robot identification:
> > 
> > 1) Allow IP addresses to be used where a user-agent would be specificifed. 
> > Some examples:
> > 
> > 	User-agent: 172.16.89.3
> > 	User-agent: 172.17.24.0/27
> > 	User-agent: fde7:a680:47d3/48
> > 
> > Yes, I'm including CIDR (Classless Inter-Domain Routing) notation to specify
> > a range of IP addresses.  And for a robot, if your IP addresss matches an IP
> > address (or range), then you need to follow the following rules.
> 
> Hmm, I'm not a huge fan of this idea (although I recognise it as a valid
> technical solution to the problem at hand, which is perhaps all you
> meant it to be). 

  Pretty much.  

> Mostly because I don't like to encourage people to
> think of IP addresses as permanently mapping to, well, just anything.
> The address of a VPN running an abusive bot today might be handed out to
> a different customer running a well-behaved bot next year.

  Fair enough.  I'm just throwing out ideas here.

> > 2) Use the fragment portion of a URL to designate a robot.  The fragment
> > portion of a URL has no meaning for a server (it does for a client).  A
> > robot could use this fact to skip it its identifier when making a request. 
> > The server MUST NOT use this information, but the logs could show it.  For
> > example, a robot could request:
> > 
> > 	gemini://example.com/robots.txt#GUS
> > 
> > A review of the logs would reveal that GUS is a robot, and the text "GUS"
> > could be placed in the User-agent: field to control it.  It SHOULD be the
> > text the robot would recognize in robots.txt.
> 
> Hmm, nice out-of-the-box thinking.  Since the suggestion has come from
> you I will assume it does not violate the letter of any RFCs, even
> though I can't shake a strange feeling that this is "abusing" the
> fragment concept a little...

  Well ... it's skating right up to the line, and may be going over it a
bit.  RFC-3986 says this about fragments:

	The fragment identifier component of a URI allows indirect
	identification of a secondary resource by reference to a primary
	resource and additional identifying information.  The identified
	secondary resource may be some portion or subset of the primary
	resource, some view on representations of the primary resource, or
	some other resource defined or described by those representations.

... and so on.  An argument could be made that a request like:

	gemini://example.com/robots.txt#Foobot

could apply, as it is "referencing" the "Foobot" section of robots.txt, but
such a claim would only be applicable to /robots.txt and not other resources
on the server.  Perhaps this could be just limited to references to
/robots.txt?

  So yes, on the line here.  And yes, it's "abusing" the fragment concept a
little ... but other than these two methods, how else would one identify a
robot on Gemini?

  -spc

Link to individual message.

---

Previous Thread: Requests for robots.txt

Next Thread: Announcing `spacewalk`