Hi > How the server produces responses to robots.txt requests is an > implementation detail. robots.txt can easily be implemented such that > the server responds with access information provided by files in > subdirectories. For example: a system directory corresponding to > /~somebody/ contains a file named ".disallow" containing > "personal.gmi". When the server builds a response to /robots.txt, it > considers the content of all ".disallow" files and includes Disallow > lines corresponding to their content. This way, individual users on a > multi-user system can decide for themselves the access policy for their > content without shared access to a canonical robots.txt. Note that the apache people worry about just doing a stat() for .htaccess along a path. This proposal requires an opendir() for *every* directory in the exported hierarchy. I concede that this isn't impossible - it is potentially expensive, messy or nonstandard (and yes, there are inotify tricks or serving the entire site out of a database, but that isn't a common thing). > > I have pitched this idea before: I think a footer containing > > the license/rules under which a page can be distributed/cached > > is more sensible than robots.txt. This approach is: > > > > * local to the page (no global /robots.txt) > > * persistent (survives being copied, mirrored & re-exported) > > * sound (one knows the conditions under which this can be redistributed) > > What if my document is a binary file of some sort that I can not add a > footer to? The only ways to address this consistently for all document > types are to > > a) Include the information in the response, *distinct* from its body > b) Provide the information in a sidecar file or sideband communication > channel So I think this is the interesting bit of the discussion - the tradeoff of keeping this information inside the file or in a sidechannel. You are of course correct that not every file format permits embedding such information, and that is the one side of the tradeoff.... the other side is the argument for persistence - having the data in another file (or in a protocol header) means that is likely to be lost. And my view is that caching/archiving/aggregating/protocol translation all involve making copies, where a careless or inconsiderate intermediate is likely to discard information not embedded in the file. For instance, if a web frontend serves gemini://example.org/private.gmi as https://example.com/gemini/example.org/private.gmi how good are the odds that this frontend fetches gemini://example.org/robots.txt, rewrites the urls in there from /private.gmi to /gemini/example.org/private.gmi and merges it into its own /robots.txt ? And does it before any crawler request is made... A pragmatist's argument: The web and geminispace are a graph of links, and all the interior nodes have to be markup, so those are covered, and they control the reachability - without a link you can't get to the terminal/leaf node. And even if this is bypassed (robots.txt isn't really a defence against hotlinking either) most other terminal nodes are images or video, which typically have ways of adding meta information (exif, etc). regards marc
---
Previous in thread (28 of 70): 🗣️ James Tomasino (tomasino (a) lavabit.com)
Next in thread (30 of 70): 🗣️ Nick Thomas (gemini (a) ur.gs)