robots.txt for Gemini formalised

🗣️ From: marc (marcx2 (a) welz.org.za)
📅 Sent: 2020-11-24 15:16
📧 Message 29 of 70
Hi

> How the server produces responses to robots.txt requests is an
> implementation detail. robots.txt can easily be implemented such that
> the server responds with access information provided by files in
> subdirectories. For example: a system directory corresponding to
> /~somebody/ contains a file named ".disallow" containing
> "personal.gmi". When the server builds a response to /robots.txt, it
> considers the content of all ".disallow" files and includes Disallow
> lines corresponding to their content. This way, individual users on a
> multi-user system can decide for themselves the access policy for their
> content without shared access to a canonical robots.txt.

Note that the apache people worry about just doing a
stat() for .htaccess along a path. This proposal requires an
opendir() for *every* directory in the exported hierarchy.

I concede that this isn't impossible - it is potentially expensive,
messy or nonstandard (and yes, there are inotify tricks or
serving the entire site out of a database, but that isn't a
common thing).

> > I have pitched this idea before: I think a footer containing
> > the license/rules under which a page can be distributed/cached
> > is more sensible than robots.txt. This approach is:
> > 
> > * local to the page (no global /robots.txt)
> > * persistent (survives being copied, mirrored & re-exported)
> > * sound (one knows the conditions under which this can be redistributed)
> 
> What if my document is a binary file of some sort that I can not add a
> footer to? The only ways to address this consistently for all document
> types are to
> 
> a) Include the information in the response, *distinct* from its body
> b) Provide the information in a sidecar file or sideband communication
>    channel

So I think this is the interesting bit of the discussion -
the tradeoff of keeping this information inside the file or
in a sidechannel. You are of course correct that not every
file format permits embedding such information, and that
is the one side of the tradeoff.... the other side is
the argument for persistence - having the data in another
file (or in a protocol header) means that is likely to be
lost.

And my view is that caching/archiving/aggregating/protocol
translation all involve making copies, where a careless or
inconsiderate intermediate is likely to discard information
not embedded in the file. For instance, if a web frontend
serves gemini://example.org/private.gmi as
https://example.com/gemini/example.org/private.gmi
how good are the odds that this frontend fetches
gemini://example.org/robots.txt, rewrites the urls in there
from /private.gmi to /gemini/example.org/private.gmi and
merges it into its own /robots.txt ? And does it before
any crawler request is made... 

A pragmatist's argument: The web and geminispace are a graph
of links, and all the interior nodes have to be markup, so those
are covered, and they control the reachability - without
a link you can't get to the terminal/leaf node. And even if
this is bypassed (robots.txt isn't really a defence against hotlinking
either) most other terminal nodes are images or video, which typically have
ways of adding meta information (exif, etc).

regards

marc
---
Previous in thread (28 of 70): 🗣️ James Tomasino (tomasino (a) lavabit.com)
Next in thread (30 of 70): 🗣️ Nick Thomas (gemini (a) ur.gs)
View entire thread.