<-- back to the mailing list

[SPEC] Encouraging HTTP Proxies to support Gemini hosts self-blacklisting

Sean Conner sean at conman.org

Mon Feb 22 01:43:58 GMT 2021

- - - - - - - - - - - - - - - - - - - 

It was thus said that the Great Mansfield once stated:

I must admit, I'm woefully lacking skill or background with robots.txt. It
seems like it could be a great answer.
A few questions to help me educate myself:
1. How often should that file be referenced by the proxy? It feels like
an answer might be, to check that URL before every request, but that goes
in the direction of some of the negative feedback about the favicon. One
user action -
one gemini request and more.

I would say once per "visit" would be good enough (say you have 50requests to make to a site---check before doing all 50). Checkingrobots.txt for *every* request is a bit too much.

2. Is 'webproxy' a standard reference to any proxy, or is that something
left to us to decide?

The guide for Gemini [1] says:

Below are definitions of various "virtual user agents", each of which corresponds to a common category of bot. Gemini bots should respect directives aimed at any virtual user agent which matches their activity. Obviously, it is impossible to come up with perfect definitions for these user agents which allow unambiguous categorisation of bots. Bot authors are encouraged to err on the side of caution and attempt to follow the "spirit" of this system, rather than the "letter". If a bot meets the definition of multiple virtual user agents and is not able to adapt its behaviour in a fine grained manner, it should obey the most restrictive set of directives arising from the combination of all applicable virtual user agents.

...

# Web Proxies

Gemini bots which fetch content in order to translate said content into HTML and publicly serve the result over HTTP(S) (in order to make Geminispace accessible from within a standard web browser) should respect robots.txt directives aimed at a User-agent of "webproxy".

So for example, if you are writing a gopher proxy (user makes a gopherrequest to get to a Gemini site), then you might want to check for"webproxy", even though you aren't actually behind a wesite but a gophersite. This is kind of a judgement call.

3. Are there globbing-like syntax rules for the Disallow field?

No. But it's not a complete literal match either.

Disallow:

will allow *all* requests.

Disallow: /

will not allow any requests at all.

Disallow: /foo

Will only disallow paths that *start* with the string '/foo', so '/foo','/foobar', '/foo/bar/baz/' will all be disallowed.

4. I'm assuming there could be multiple rules that need to be mixed. Is
there a standard algorithm for that process? E.g.:
User-agent: webproxy
Disallow: /a
Allow: /a/b
Disallow: /a/b/c

Allow: isn't in the standard per se, but many crawlers do accept it. Andthe rules for a user agent are applied in order they're listed. First matchwins.

Again - it seems like this could work out really well.
Thanks for helping me learn a bit more!

More about it can be read here [2].

-spc

[1] https://portal.mozz.us/gemini/gemini.circumlunar.space/docs/companion/robots.gmi

[2] http://www.robotstxt.org/robotstxt.html