robots.txt for Gemini formalised

On Sun, Nov 22, 2020 at 6:03 PM Drew DeVault <sir at cmpwn.com> wrote:


> A web portal is a regular user agent, not a robot.
>

Agreed.  However, The spec says "publicly serve the result", and a *public*
proxy can pound a Gemini server if a lot of Web clients are accessing it
concurrently.  It should be able to find out whether the server is robust
to such operations or not.

By the same token, a public Gopher proxy (if there are any) should respect
"Disallow: gopherproxy".

Other points:
+1 for Allow:
+1 for Virtual-Agent
+1 for ignoring unknown lines
Unsure what the difference is between Crawl-Delay: and Check:, but having a
retry delay is a Good Thing

Additionally:  "Agent:" should specify a SHA-256 hash of the client cert
used by particular crawlers rather than a random easy-to-forge name.  Thus
GUS should crawl using a cert and publicly post the hash of this cert.
Then callers with that cert are necessarily GUS, since the cert itself is
not published.  (Of course it's still possible for a server to steal GUS's
client cert.)


> Maybe we could normalize robots fetching robots.txt with the query
> string set to some useful identifiying information? This would allow
> gemini administrators to make bot-specific rules, understand the
> behavior of their logs, and get in touch with the operator if
> necessary.
>

The trouble is that completely different pages can be returned with
different query strings that are entirely unrelated to actual searching, so
it's inappropriate to usurp the query string for this purpose.  That's not
to say that agent control can't rely on the query string.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Gules six bars argent on a canton azure 50 mullets argent
six five six five six five six five and six
   --blazoning the U.S. flag <http://web.meson.org/blazonserver>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201122/f4d1
e563/attachment.htm>

---

Previous in thread (3 of 70): 🗣️ Drew DeVault (sir (a) cmpwn.com)

Next in thread (5 of 70): 🗣️ Adnan Maolood (me (a) adnano.co)

View entire thread.