Hi folks, There is now (finally!) an official reference on the use of robots.txt files in Geminispace. Please see: gemini://gemini.circumlunar.space/docs/companion/robots.gmi I attempted to take into account previous discussions on the mailing list and the currently declared practices of various well-known Gemini bots (broadly construed). I don't consider this "companion spec" to necessarily be finalised at this point, but I am primarily interested in hearing suggestions for change from either authors of software which tries to respect robots.txt who are having problems caused by the current specification, or from server admins who are having bot problems who feel that the current specification is not working for them. The biggest gap that I can currently see is that there is no advice on how often bots should re-query robots.txt to check for policy changes. I could find no clear advice on this for the web, either. I would be happy to hear from people who've written software that uses robots.txt with details on what their current practices are in this respect. Cheers, Solderpunk
It was thus said that the Great Solderpunk once stated: > Hi folks, > > There is now (finally!) an official reference on the use of robots.txt > files in Geminispace. Please see: > > gemini://gemini.circumlunar.space/docs/companion/robots.gmi Nice. > I attempted to take into account previous discussions on the mailing > list and the currently declared practices of various well-known Gemini > bots (broadly construed). > > I don't consider this "companion spec" to necessarily be finalised at > this point, but I am primarily interested in hearing suggestions for > change from either authors of software which tries to respect robots.txt > who are having problems caused by the current specification, or from > server admins who are having bot problems who feel that the current > specification is not working for them. Right now, there are two things I would change. 1. Add "allow". While the initial spec [1] did not have an allow rule, a subsequent draft proposal [2] did, which Google is pushing (as of 2019) to become an RFC [3]. 2. I would specify virtual agents as: Virtual-agent: archiver Virtual-agent: indexer This makes it easier to add new virtual agents, separates the namespace of agents from the namespace of virtual agents, and is allowed by all current and proposed standards [4]. The rule I would follow is: Definitions: specific user agent is one that is not '*' specific virtual agent is one that is not '*' generic user agent is one that is specified as '*' generic virtual agent is one that is '*' A crawler should use a block of rules: if it finds a specific user agent (most targetted) or it finds a specific virtual agent or it finds a generic virtual agent or it finds a generic user agent (least targetted) I'm wavering on the generic virtual agent bit, so if you think that makes this too complicated, fine, I think it can go. > The biggest gap that I can currently see is that there is no advice on > how often bots should re-query robots.txt to check for policy changes. > I could find no clear advice on this for the web, either. I would be > happy to hear from people who've written software that uses robots.txt > with details on what their current practices are in this respect. The Wikipedia page [5] lists a non-standard extension "Crawl-delay" which informs a crawler how often they should make requests. It might be easy to add a field saying how often to fetch a resource. A sample file: # The GUS agent, plus any agent that identifies as an "indexer" is allowed # one path in an otherwise disallowed place, and only fetch items in 10 # second increments. User-agent: GUS Virtual-agent: indexer Allow: /private/butpublic Disallow: /private Crawl-delay: 10 # Agents that fetch feeds, should only grab every 6 hours. "Check" is # allowed as agents should ignore fields it doesn't understand. Virtual-agent: feed Disallow: /private Check: 21600 # And a fallback. Here we don't allow any old crawler into the private # space, and we force them to use 20 seonds between fetches. User-agent: * Disallow: /private Crawl-delay: 20 -spc [1] gemini://gemini.circumlunar.space/docs/companion/robots.gmi [2] http://www.robotstxt.org/norobots-rfc.txt [3] https://developers.google.com/search/reference/robots_txt [4] Any field not understood by a crawler should be ignored. [5] https://en.wikipedia.org/wiki/Robots_exclusion_standard
Feedback: A web portal is a regular user agent, not a robot. Maybe we could normalize robots fetching robots.txt with the query string set to some useful identifiying information? This would allow gemini administrators to make bot-specific rules, understand the behavior of their logs, and get in touch with the operator if necessary.
On Sun, Nov 22, 2020 at 6:03 PM Drew DeVault <sir at cmpwn.com> wrote: > A web portal is a regular user agent, not a robot. > Agreed. However, The spec says "publicly serve the result", and a *public* proxy can pound a Gemini server if a lot of Web clients are accessing it concurrently. It should be able to find out whether the server is robust to such operations or not. By the same token, a public Gopher proxy (if there are any) should respect "Disallow: gopherproxy". Other points: +1 for Allow: +1 for Virtual-Agent +1 for ignoring unknown lines Unsure what the difference is between Crawl-Delay: and Check:, but having a retry delay is a Good Thing Additionally: "Agent:" should specify a SHA-256 hash of the client cert used by particular crawlers rather than a random easy-to-forge name. Thus GUS should crawl using a cert and publicly post the hash of this cert. Then callers with that cert are necessarily GUS, since the cert itself is not published. (Of course it's still possible for a server to steal GUS's client cert.) > Maybe we could normalize robots fetching robots.txt with the query > string set to some useful identifiying information? This would allow > gemini administrators to make bot-specific rules, understand the > behavior of their logs, and get in touch with the operator if > necessary. > The trouble is that completely different pages can be returned with different query strings that are entirely unrelated to actual searching, so it's inappropriate to usurp the query string for this purpose. That's not to say that agent control can't rely on the query string. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org Gules six bars argent on a canton azure 50 mullets argent six five six five six five six five and six --blazoning the U.S. flag <http://web.meson.org/blazonserver> -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201122/f4d1 e563/attachment.htm>
On Sun Nov 22, 2020 at 7:30 PM EST, John Cowan wrote: > Additionally: "Agent:" should specify a SHA-256 hash of the client cert > used by particular crawlers rather than a random easy-to-forge name. > Thus > GUS should crawl using a cert and publicly post the hash of this cert. > Then callers with that cert are necessarily GUS, since the cert itself > is > not published. (Of course it's still possible for a server to steal > GUS's > client cert.) This doesn't seem very useful, as bad robots can simply ignore the rules in robots.txt.
Of course they can: that's always true, as the pre-spec already says. The idea is to give crawlers (etc.) that want to keep to the rules some way to clearly and uniquely identify themselves to servers. On Sun, Nov 22, 2020 at 7:39 PM Adnan Maolood <me at adnano.co> wrote: > On Sun Nov 22, 2020 at 7:30 PM EST, John Cowan wrote: > > Additionally: "Agent:" should specify a SHA-256 hash of the client cert > > used by particular crawlers rather than a random easy-to-forge name. > > Thus > > GUS should crawl using a cert and publicly post the hash of this cert. > > Then callers with that cert are necessarily GUS, since the cert itself > > is > > not published. (Of course it's still possible for a server to steal > > GUS's > > client cert.) > > This doesn't seem very useful, as bad robots can simply ignore the rules > in robots.txt. > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201122/4c9e 06ec/attachment.htm>
This looks great! I'm excited to see this companion spec become more formalized, and really like the categorical virtual agent design. One thing that stuck out to me after a first read was the `webproxy` user agent. What would you think of something like the following instead: `proxy` `proxy-web` `proxy-gopher` The prefixed design doesn't have any drawbacks as far as I can tell, and would allow for more intuitively designed blocking/allowing hierarchies. E.g., if there were 4 different types of proxies in use, and you only wanted to allow one, you could be more restrictive with `proxy` and less restrictive with the more precise, suffixed user agent of the type you are okay with (e.g., `proxy-gopher`). Emphasis on "more intuitively designed" - I realize you could technically accomplish this in the current design by simply adding `proxy` to the mix, but I think the prefix-based organization makes it clearer and a bit more intuitive. Warm regards, Natalie
November 22, 2020 6:02 PM, "Drew DeVault" <sir at cmpwn.com> wrote: > Feedback: > > A web portal is a regular user agent, not a robot. Just throwing in here for consideration that I agree with Drew, a proxy is not a robot by default. Are we implying that a browser must also follow robots.txt to be well-behaved? If so, I might just block AV-98 from reading my capsule. :) What I would recommend in lieu of robots.txt proxy rules is normalizing using robots.txt on the web side of a proxy to prevent web spiders from inadvertantly crawling gemspace. For instance, proxy.vulpes.one blocks every robot user agent from indexing any part of the site. Is there any good usecase for a proxy User-Agent in robots.txt, other than blocking web spiders from being able to crawl gemspace? If not, I would be in favor of dropping that part of the definition. Just my two cents, Robert "khuxkm" Miles
It was thus said that the Great Robert khuxkm Miles once stated: > > Is there any good usecase for a proxy User-Agent in robots.txt, other than > blocking web spiders from being able to crawl gemspace? If not, I would be > in favor of dropping that part of the definition. I'm in favor of dropping that part of the definition as it doesn't make sense at all. Given a web based proxy at <https://example.com/gemini>, web crawlers will check for <https://example.com/robots.txt> for guidance, not <https://example.com/gemini?gemini.conman.org/robots.txt>. Web crawlers will not be able to crawl gemini space for two main reasons: 1. Most server certificates are self-signed and opt out of the CA business. And even if a crawler where to accept self-signed (or non-standard CA signed) certificates, then--- 2. The Gemini protocol is NOT HTTP, so all such HTTP requests will fail anyway. -spc
November 22, 2020 9:05 PM, "Sean Conner" <sean at conman.org> wrote: > It was thus said that the Great Robert khuxkm Miles once stated: > >> Is there any good usecase for a proxy User-Agent in robots.txt, other than >> blocking web spiders from being able to crawl gemspace? If not, I would be >> in favor of dropping that part of the definition. > > I'm in favor of dropping that part of the definition as it doesn't make > sense at all. Given a web based proxy at <https://example.com/gemini>, web > crawlers will check for <https://example.com/robots.txt> for guidance, not > <https://example.com/gemini?gemini.conman.org/robots.txt>. Web crawlers > will not be able to crawl gemini space for two main reasons: > > 1. Most server certificates are self-signed and opt out of the CA > business. And even if a crawler where to accept self-signed > (or non-standard CA signed) certificates, then--- > > 2. The Gemini protocol is NOT HTTP, so all such HTTP requests will > fail anyway. > > -spc Well, the argument is that the crawler would access <https://example.com/gemini?gemini://gemini.conman.org/>, and from there it could access <https://example.com/gemini?gemini://zaibatsu.circumlunar.space/>, and then <https://example.com/gemini?gemini://gemini.circumlunar.space/>, and so on. However, I'd argue that the onus falls on example.com to set a robots.txt rule in <https://example.com/robots.txt> to prevent web crawlers from indexing anything with their proxy. Just my two cents, Robert "khuxkm" Miles
A web portal is a one-to-one mapping of a user request to a gemini request. It's not an automated process. It's a genuine user agent, an agent of a user. The level of traffic you'd receive from a web portal is similar to the amount of traffic you'd receive from any other user agent, and rate controls or access blocking don't make sense. As the maintainer of such a web portal, I officially NACK any suggestion that it should obey robots.txt, and will not introduce such a feature.
It was thus said that the Great Drew DeVault once stated: > A web portal is a one-to-one mapping of a user request to a gemini > request. It's not an automated process. It's a genuine user agent, an > agent of a user. The level of traffic you'd receive from a web portal is > similar to the amount of traffic you'd receive from any other user > agent, and rate controls or access blocking don't make sense. > > As the maintainer of such a web portal, I officially NACK any suggestion > that it should obey robots.txt, and will not introduce such a feature. What's the IP address of your web portal, so I can block it and prevent the various webbots that will go through your web portal and index the Gemini content without my consent? -spc
November 22, 2020 10:31 PM, "Sean Conner" <sean at conman.org> wrote: > It was thus said that the Great Drew DeVault once stated: > >> A web portal is a one-to-one mapping of a user request to a gemini >> request. It's not an automated process. It's a genuine user agent, an >> agent of a user. The level of traffic you'd receive from a web portal is >> similar to the amount of traffic you'd receive from any other user >> agent, and rate controls or access blocking don't make sense. >> >> As the maintainer of such a web portal, I officially NACK any suggestion >> that it should obey robots.txt, and will not introduce such a feature. > > What's the IP address of your web portal, so I can block it and prevent > the various webbots that will go through your web portal and index the > Gemini content without my consent? > > -spc I assume Drew's smart enough to block web bots from crawling his gemini portal. Just saying. Just my two cents, Robert "khuxkm" Miles
On Sun Nov 22, 2020 at 10:31 PM EST, Sean Conner wrote: > What's the IP address of your web portal, so I can block it and prevent > the various webbots that will go through your web portal and index the > Gemini content without my consent? It's not an indexer. It's a user agent. And its IP address is 173.195.146.137. Dick.
On Sun, Nov 22, 2020 at 10:07 PM Drew DeVault <sir at cmpwn.com> wrote: A web portal is a one-to-one mapping of a user request to a gemini > request. It's not an automated process. It's a genuine user agent, an > agent of a user. > It is the agent of an arbitrarily large number of users. That's the difference between, say, an email user agent and an email gateway to a non-Internet email system. There is no reason to impose even soft regulation on the former. There is every reason to allow regulation of the latter. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org The experiences of the past show that there has always been a discrepancy between plans and performance. --Emperor Hirohito, August 1945 -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201122/7e1d 280a/attachment.htm>
It was thus said that the Great Robert khuxkm Miles once stated: > November 22, 2020 10:31 PM, "Sean Conner" <sean at conman.org> wrote: > > > It was thus said that the Great Drew DeVault once stated: > > > >> A web portal is a one-to-one mapping of a user request to a gemini > >> request. It's not an automated process. It's a genuine user agent, an > >> agent of a user. The level of traffic you'd receive from a web portal is > >> similar to the amount of traffic you'd receive from any other user > >> agent, and rate controls or access blocking don't make sense. > >> > >> As the maintainer of such a web portal, I officially NACK any suggestion > >> that it should obey robots.txt, and will not introduce such a feature. > > > > What's the IP address of your web portal, so I can block it and prevent > > the various webbots that will go through your web portal and index the > > Gemini content without my consent? > > > > -spc > > I assume Drew's smart enough to block web bots from crawling his gemini > portal. Just saying. > > Just my two cents, Drew's proxy is a webserver in its own right: https://git.sr.ht/~sircmpwn/kineto/tree/master/main.go It checks for a GET request for "/favicon.ico" but not to "/robots.txt". Every other GET request is immediately proxied to a gemini server. I think it was meant to run locally, but he made an instance available on the public Internet. -spc
On Sun Nov 22, 2020 at 11:51 PM EST, John Cowan wrote: > It is the agent of an arbitrarily large number of users. So is every other user agent. It will never make more requests than there are users who are asking for content. It is not special.
On 11/23/20 2:30 AM, John Cowan wrote: > > By the same token, a public Gopher proxy (if there are any) should > respect "Disallow: gopherproxy". > > Other points: > +1 for Allow: > +1 for Virtual-Agent > +1 for ignoring unknown lines > Unsure what the difference is between Crawl-Delay: and Check:, but > having a retry delay is a Good Thing A small nit-pick: if we use "Virtual-Agent" and "Crawl-Delay", we should at least use "gopher-proxy" instead of "gopherproxy". -- Emilis Dambauskas gemini://tilde.team/~emilis/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201124/de97 bb6b/attachment.htm>
-1 to Virtual-Agent I think that this is best formalized as an addendum to the existing robots.txt conventions, which simply details a gemini-specific interpretation as such.
Hi I suppose I am chipping it a bit too late here, but I think the robots.txt thing was always a rather ugly mechanism - a bit of an afterthought. Consider the gemini://example.com/~somebody/personal.gmi - if somebody wishes to exclude personal.gmi from being crawled they need write access to example.com/robots.txt, and how do we go about making sure that ~somebodyelse, also on example.com doesn't overwrite robots.txt with their own rules ? Then there is the problem of transitivity - if we have a portal, proxy or archive - how does it relay the information to its downstream users ? See also the exchange between Sean and Drew... So the way I remember it, robots.txt was a quick hack to prevent spiders getting trapped in a maze of cgi generated data, and so hammering the server. It wasn't designed to solve matters of privacy and redistribution. I have pitched this idea before: I think a footer containing the license/rules under which a page can be distributed/cached is more sensible than robots.txt. This approach is:
On 24.11.2020, marc wrote: > I suppose I am chipping it a bit too late here, but I think > the robots.txt thing was always a rather ugly mechanism - a > bit of an afterthought. +1 that the robots.txt solution feels a lot like a hack. > So the way I remember it, robots.txt was a quick hack > to prevent spiders getting trapped in a maze of > cgi generated data, and so hammering the server. > It wasn't designed to solve matters of privacy > and redistribution. There is a more modern alternative to robots.txt which is the X-Robots-Tag HTTP header and sounds like what you are trying to do here. That said, there are probably people who will not want special headers to be added [1], altough I personally think that something like you suggest would not be that "exploitable". Especially because it is just part of the documents text. [1] See the first sentence of ?2.4 of the Gemini FAQ gemini://gemini.circumlunar.space/docs/faq.gmi https://gemini.circumlunar.space/docs/faq.html -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 840 bytes Desc: OpenPGP digital signature URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201124/ebbb 436e/attachment.sig>
On Tue, 24 Nov 2020 11:29:02 +0100 marc <marcx2 at welz.org.za> wrote: > Consider the gemini://example.com/~somebody/personal.gmi - > if somebody wishes to exclude personal.gmi from being > crawled they need write access to example.com/robots.txt, > and how do we go about making sure that ~somebodyelse, > also on example.com doesn't overwrite robots.txt with > their own rules ? How the server produces responses to robots.txt requests is an implementation detail. robots.txt can easily be implemented such that the server responds with access information provided by files in subdirectories. For example: a system directory corresponding to /~somebody/ contains a file named ".disallow" containing "personal.gmi". When the server builds a response to /robots.txt, it considers the content of all ".disallow" files and includes Disallow lines corresponding to their content. This way, individual users on a multi-user system can decide for themselves the access policy for their content without shared access to a canonical robots.txt. > I have pitched this idea before: I think a footer containing > the license/rules under which a page can be distributed/cached > is more sensible than robots.txt. This approach is: > > * local to the page (no global /robots.txt) > * persistent (survives being copied, mirrored & re-exported) > * sound (one knows the conditions under which this can be redistributed) What if my document is a binary file of some sort that I can not add a footer to? The only ways to address this consistently for all document types are to a) Include the information in the response, *distinct* from its body b) Provide the information in a sidecar file or sideband communication channel -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201124/fb81 9cb7/attachment.sig>
Hi, On Sun, 2020-11-22 at 17:31 +0100, Solderpunk wrote: > Hi folks, > > There is now (finally!) an official reference on the use of > robots.txt > files in Geminispace. Please see: > > gemini://gemini.circumlunar.space/docs/companion/robots.gmi Thanks for this. One change that I'd be interested in is adding a statement that if there is no `robots.txt` for the site, we assume an implicit disallow-all for all the virtual-agents except proxies. Presumed consent, with opt-outs for the tiny minority of people who have the time and mental space to work out how to get those opt-outs to apply, is standard behaviour on the web, but it's not behaviour I like. GitHub recently dumped code of mine into an arctic vault, for instance; the archive.org snapshots of geminispace have similar dynamics. We can do better by asking people to opt *in* to these kinds of things if they want it, rather than to opt *out* if they don't. I exclude Virtual-Agent: webproxy here because the likely use of such a proxy is transient, rather than persistent. It seems odd to me that it sits alongside indexing, archival, and research, all of which lead to durable artifacts on success. It does complicate things a little to treat it differently, thought. Thoughts? I appreciate this would impact on the ability of archivists or researchers to capture geminispace, but I see that as a feature, rather than an unfortunate side-effect :). /Nick
On Tue, Nov 24, 2020 at 6:42 AM Nick Thomas <gemini at ur.gs> wrote: > Thoughts? I appreciate this would impact on the ability of archivists > or researchers to capture geminispace, but I see that as a feature, > rather than an unfortunate side-effect :). I don't agree with archiving being disallowed by default. archive.org and others have saved me so many times, I can't imagine why one would not want an archive. If there is a reason I would much prefer an opt-out system for it. Why do you dislike archival?
Just an FYI on the recent discussion around implied license for search engines and archival: These aren't rules baked into a spec, they're implications of the DMCA in the US and relevant case law, such as BLAKE A. FIELD vs GOOGLE (2016). The existence of a mechanism to disallow indexing was vital to that decision establishing implied license. Search engines, whether they be our lovely friend GUS or some future behemoth, can gather, index, and cache as they see fit because there is a mechanism for you to say no. That mechanism is the robots.txt and they have a strong case saying that the rules which govern it are already well established. As much as I'd love to wave a magic wand and say, "it's all opt-in here" we don't really have any legal footing to do so.
"Drew DeVault" <sir at cmpwn.com> writes: > A web portal is a one-to-one mapping of a user request to a gemini > request. It's not an automated process. It's a genuine user agent, an > agent of a user. I believe the concern is not that a web portal will archive pages, or run on its own as an automated process, but that it will be used by a third-party web bot (i.e., one not run by the owner of the portal) to crawl Gemini sites and index them on the web. > As the maintainer of such a web portal, I officially NACK any > suggestion that it should obey robots.txt, and will not introduce such > a feature. It seems to me that the correct thing is for people that run web portals to have a very strong robots.txt on /their/ web site, and additionally, to be proactive about blocking web bots that don't observe robots.txt. I think people want to block web portals in their Gemini robots.txt because they don't trust web portal authors to do those two things. I understand the feeling, but they're still trusting web portal authors to obey robots.txt, which is honestly more work. -- +-----------------------------------------------------------+ | Jason F. McBrayer jmcbray at carcosa.net | | A flower falls, even though we love it; and a weed grows, | | even though we do not love it. -- Dogen |
On Tue Nov 24, 2020 at 9:06 AM EST, Jason McBrayer wrote: > I believe the concern is not that a web portal will archive pages, or > run on its own as an automated process, but that it will be used by a > third-party web bot (i.e., one not run by the owner of the portal) to > crawl Gemini sites and index them on the web. Aha, this is a much better point. One which should probably be addressed in the robots.txt specification. > It seems to me that the correct thing is for people that run web portals > to have a very strong robots.txt on /their/ web site, and additionally, > to be proactive about blocking web bots that don't observe robots.txt. I > think people want to block web portals in their Gemini robots.txt > because they don't trust web portal authors to do those two things. I > understand the feeling, but they're still trusting web portal authors to > obey robots.txt, which is honestly more work. Web portals are users, plain and simple. Anyone who blocks a web portal is blocking legitimate users who are engaging in legitimate activity. This is a dick move and I won't stand up for anyone who does it. However, the issue of web crawlers hitting geminispace through a web portal is NOT that, and I'm glad you brought it up. I'm going to forbid web crawlers from crawling my gemini portal.
On 11/24/20 1:15 PM, A. E. Spencer-Reed wrote: > On Tue, Nov 24, 2020 at 6:42 AM Nick Thomas <gemini at ur.gs> wrote: > >> Thoughts? I appreciate this would impact on the ability of archivists >> or researchers to capture geminispace, but I see that as a feature, >> rather than an unfortunate side-effect :). > I don't agree with archiving being disallowed by default. archive.org > and others have saved me so many times, I can't imagine why one would > not want an archive. If there is a reason I would much prefer an > opt-out system for it. > Why do you dislike archival? Denying archival is already possible with robots.txt in its present form. We don't need to edit the spec for that either. If you want to avoid the internet archive you can use: User-agent: ia_archiver Disallow: /
Hi > How the server produces responses to robots.txt requests is an > implementation detail. robots.txt can easily be implemented such that > the server responds with access information provided by files in > subdirectories. For example: a system directory corresponding to > /~somebody/ contains a file named ".disallow" containing > "personal.gmi". When the server builds a response to /robots.txt, it > considers the content of all ".disallow" files and includes Disallow > lines corresponding to their content. This way, individual users on a > multi-user system can decide for themselves the access policy for their > content without shared access to a canonical robots.txt. Note that the apache people worry about just doing a stat() for .htaccess along a path. This proposal requires an opendir() for *every* directory in the exported hierarchy. I concede that this isn't impossible - it is potentially expensive, messy or nonstandard (and yes, there are inotify tricks or serving the entire site out of a database, but that isn't a common thing). > > I have pitched this idea before: I think a footer containing > > the license/rules under which a page can be distributed/cached > > is more sensible than robots.txt. This approach is: > > > > * local to the page (no global /robots.txt) > > * persistent (survives being copied, mirrored & re-exported) > > * sound (one knows the conditions under which this can be redistributed) > > What if my document is a binary file of some sort that I can not add a > footer to? The only ways to address this consistently for all document > types are to > > a) Include the information in the response, *distinct* from its body > b) Provide the information in a sidecar file or sideband communication > channel So I think this is the interesting bit of the discussion - the tradeoff of keeping this information inside the file or in a sidechannel. You are of course correct that not every file format permits embedding such information, and that is the one side of the tradeoff.... the other side is the argument for persistence - having the data in another file (or in a protocol header) means that is likely to be lost. And my view is that caching/archiving/aggregating/protocol translation all involve making copies, where a careless or inconsiderate intermediate is likely to discard information not embedded in the file. For instance, if a web frontend serves gemini://example.org/private.gmi as https://example.com/gemini/example.org/private.gmi how good are the odds that this frontend fetches gemini://example.org/robots.txt, rewrites the urls in there from /private.gmi to /gemini/example.org/private.gmi and merges it into its own /robots.txt ? And does it before any crawler request is made... A pragmatist's argument: The web and geminispace are a graph of links, and all the interior nodes have to be markup, so those are covered, and they control the reachability - without a link you can't get to the terminal/leaf node. And even if this is bypassed (robots.txt isn't really a defence against hotlinking either) most other terminal nodes are images or video, which typically have ways of adding meta information (exif, etc). regards marc
(I could be a lot better at using mailing lists. I think this message was sent privately in error). On Tue, 2020-11-24 at 08:15 -0500, A. E. Spencer-Reed wrote: > Why do you dislike archival? Thanks for weighing in! In short, because the purposes to which the archive can be put, and the motives of the archiver, are not clear at time of robots.txt-mediated archival. For myself, I'm happy with some types of archival, and not happy with some other types. Some people would be happy to be included in every archive going; others, in none of them. Given this variability, we must take a stance on what to assume if robots.txt isn't present. I also I don't think this variability is amenable to capture with more fine- grained virtual agents. The current internet-draft for robots.txt says, in 2.2.1: > If no group satisfies either condition, or no groups are present at > all, no rules apply. ( https://tools.ietf.org/html/draft-koster-rep-00 ) This is pretty standard on the Web and, entirely coincidentally, a huge boon to Google et al. Importing robots.txt the way we do in the companion specification also imports this line. However, unlike the Web, Gemini "takes user privacy very seriously". Archives *can* be injurious to user privacy - if you need convincing on this point, there are a range of cases and examples around GDPR "right to be forgotten" stuff. To my perspective, Gemini is important a line from the internet-draft that is directly contrary to its mission. Combining Gemini's mission with that realisation means that if no statement has been made about whether the given user (server operator in this specific case) is OK with their content being archived, the presumption should be that they are not OK with it. We should value user privacy above archiver convenience. In affect, we add a second exception to the protocal that amends 2.2.1 to end "if no rules are specified, this robots.txt file MUST be assumed". On a practical level, being excluded from search engines by-default drives the discoverability of robots.txt, and server software could easily include flags like --permit-indexing or --permit-archival to streamline that discoverability. I don't think that opt-in rates would be similar to current opt-out rates on the Web. /Nick
Nick Thomas wrote:> I don't think that opt-in rates would be similar to current opt-out rates > on the Web. This can probably be summed up with one question: Why do we want a robots.txt in the first place? After all, if there were no reasons against archival et al., we would not need a robots.txt at all. And IMHO this also is the reason why it should rather be an opt-in system. -- You can verify the digital signature on this email with the public key available through web key discovery. -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 840 bytes Desc: OpenPGP digital signature URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201124/944d d2b6/attachment.sig>
On Tue, 2020-11-24 at 13:31 +0000, James Tomasino wrote: > > As much as I'd love to wave a magic wand and say, "it's all opt-in > here" we don't really have any legal footing to do so. > James and I talked a bit more about this one on IRC. Key to this argument, AIUI, is how robots.txt (or the lack of it) is treated for FTP, which lacks any mention of it in the spec but has apparently been given weight in DMCA-related rulings involving it. I'm not sure I agree with the reasoning, which goes something like "the robots.txt Internet-Draft is already de-jure part of Gemini, and we can't change that", but IANAL ^^. In particular, I've been thinking about this almost entirely in GDPR terms so far, and have a bunch of DMCA-related reading to do now. In the event that it *is* accurate, we talked about an alternative way to implement the functionality. Rather than having the gemini robots.txt spec say "if the client doesn't receive a robots.txt, it must assume this one", the *server* could be made to return a defined robots.txt response body if it would otherwise issue a 51 response to `/robots.txt` (51 may be too specific, it could be 5x, but I don't *think* it would be appropriate in response to 4x responses, which crawlers would be expected to retry). Of course, any server could do that already today, so the ask is to put a recommendation about it into "server best practice", perhaps incorporating the `--permit-indexing` and `--permit-archiving` flags I talked about in another post. Another advantage of this approach is that it becomes opaque to crawler authors whether the user has explicitly selected a preference or not. I'm also inclined to trust server implementors over crawler implementors. /Nick p.s. there was also some question as to whether someone hosting gemini content was a "gemini user", in the way we use that term on the project homepage. To me, it seems like a reasonable extrapolation, but perhaps it's a topic that deserves more debate or clarification.
On 11/24/20 5:12 PM, Nick Thomas wrote: > On Tue, 2020-11-24 at 13:31 +0000, James Tomasino wrote: >> As much as I'd love to wave a magic wand and say, "it's all opt-in >> here" we don't really have any legal footing to do so. >> > James and I talked a bit more about this one on IRC. Key to this > argument, AIUI, is how robots.txt (or the lack of it) is treated for > FTP, which lacks any mention of it in the spec but has apparently been > given weight in DMCA-related rulings involving it. > > I'm not sure I agree with the reasoning, which goes something like "the > robots.txt Internet-Draft is already de-jure part of Gemini, and we > can't change that", but IANAL ^^. In particular, I've been thinking > about this almost entirely in GDPR terms so far, and have a bunch of > DMCA-related reading to do now. In addition to FTP, gopher adopted the robots.txt standard almost immediately: https://groups.google.com/g/comp.internet.net-happenings/c/Iv8ylGxvoh8?pli=1 You can read the IETF spec for the Robots Exclusion Protocol here: https://tools.ietf.org/html/draft-rep-wg-topic-00 As you'll note in "2.3. Access method", their documentation isn't scheme specific and they even list FTP as a valid option. This is the document that will be used in court by anyone defending an indexer and any exclusion you want to obtain for Gemini would need to happen there. Having a contradictory statement in the Gemini spec will not stand up against the history and precedence of this one. If you want to implement stronger protections in Gemini then I'd suggest adding a note in the best-practices document for server creators to (as Nick suggested) serve a robots.txt if no such file exists with the contents: User-agent: * Disallow: / That achieves your aim of block-by-default and the opt-in would be the creation of a robots.txt file of your own.
On Tue Nov 24, 2020 at 3:07 PM CET, Drew DeVault wrote: > Web portals are users, plain and simple. Anyone who blocks a web portal > is blocking legitimate users who are engaging in legitimate activity. > This is a dick move and I won't stand up for anyone who does it. This has actually long been a bit of a contentious point in the Gopherverse, and we have inherited a bit of the controversy, if I remember much earlier discussions accurately. There are some people (a vocal minority? I'm not sure), who feel that public web proxies exposing their Gopherhole/capsule to the entire browser-using world are negating the agency they exercised in very deliberately putting some content up only on Gopher/Gemini and not the web. Web proxies force them to be visible in (and linkable from) a space that they are actively trying not to participate in. While I am aware of the ultimate futility of trying to control where publically served online content ends up, I have some sympathy for this perspective (perhaps even more so now that we have very nice tools like your own Kineto by which people who *do* want their content to be accessible from a browser can achieve this easily). When the first web portals for Gemini turned up, some people expressed interest in being able to opt out, to keep their Gemini-only content truly Gemini-only, and at least one of those early web portals (portal.mozz.us) agreed to respect those wishes. The webproxy user agent I put into the first robots.txt draft is actually just codifying what portal.mozz.us has already been doing for many months. I did not expect its inclusion to be so controversial. I *did* try to word it carefully so that personal webproxies which, e.g. run on a user's local machine and are not publically accessible need not abide by robots.txt, as those are really just roundabout Gemini clients. Cheers, Solderpunk
I am personally against this idea of forcing (or even normalizing) browsers giving special treatment to a request for a URL based on what the server would normally respond (I'm not even going to entertain the idea of pretending the internet draft doesn't apply to us). This is what I assume it would look like in spec (or best practices, or wherever you want to put it): > When a client makes a request for a URI with a path component of "/robots.txt", and the server would normally respond to such a request with a 51 Not Found status code, it should instead respond with a 20 status code, a MIME type of text/plain, and content of "User-Agent:
On Tue, 2020-11-24 at 19:08 +0000, Robert "khuxkm" Miles wrote: > Doesn't that just *feel* like a hack to you? It definitely feels hackish when worded like this :). The precise technical form is secondary to the outcome (as I see it) of protecting users from a privacy-hostile default in the robots.txt specification. I appreciate that you're currently an opt-out, rather than opt-in, advocate, but I'd still appreciate any ideas you have to make it nicer *if* gemini ends up going for opt-in. An alternative form that just came to mind is a server implementation recommendation like this: ``` Geminispace crawlers use the /robots.txt request path to determine whether a capsule can be accessed for archival, indexing, research, and other purposes. This can have privacy implications for the user, so servers should not start unless they have an explicit signal on how to handle requests to the /robots.txt path. For example, this signal may be the availability of any content for the /robots.txt path, a user-added database entry indicating that the path should receive a 5x response, or a non-default configuration parameter specifying that it's OK to skip the check. If no such signal is present, the server should emit an error message and either exit immediately, or allow the user to specify how the path should be handled. ``` As a new server operator with no idea about `robots.txt`, I'd run, say: ``` $ agate [::]:1995 mysite cert.pem key.rsa ur.gs No robots.txt file present! Please create mysite/robots.txt, or re-run Agate with --permit-robots to allow your content to be archived, indexed, or otherwise used by automated crawlers of Geminispace ```
On Tue, Nov 24, 2020 at 3:25 PM Nick Thomas <gemini at ur.gs> wrote: > > Of the 362 hosts known to GUS, only 36 have a robots.txt file, so > > any choice made as to what the default robots.txt should be will > > affect around 90% of Geminispace > > Thanks for running the numbers on this. I agree with everything you > said based on them. That any change affects such a large proportion of > existing geminispace is especially worth emphasising. > Why is that a Good Thing? It's another piece of bureaucracy: 90% of hosts were happy to be archived before, so now they have to write a robots.txt file. Although small for any one server operator, it is large when multiplied by the number of servers there *will be*. "Small Internet" does not mean "Internet with only a few servers", AFAIK. Two things about the Internet Archive: 1) It is a U.S. public library, which gives it special rights when it comes to making copies. 2) Though it does not respect robots.txt, it is happy to make your content invisible to archive users by informal request (or, of course, by a DCMA takedown notice). John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org Gules six bars argent on a canton azure 50 mullets argent six five six five six five six five and six --blazoning the U.S. flag <http://web.meson.org/blazonserver> -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201124/cf7c 7400/attachment.htm>
On 11/24/20 11:44 PM, John Cowan wrote: > 2) Though it does not respect robots.txt, it is happy to make your > content invisible to archive users by informal request (or, of course, > by a DCMA takedown notice). The Internet Archive does respect robots.txt, though they're not happy about it and have written on the subject a few times. I included a snippet in an earlier email with their user-agent.
On Tue, 2020-11-24 at 18:44 -0500, John Cowan wrote: > On Tue, Nov 24, 2020 at 3:25 PM Nick Thomas <gemini at ur.gs> wrote: > > > Thanks for running the numbers on this. I agree with everything you > > said based on them. That any change affects such a large proportion > > of > > existing geminispace is especially worth emphasising. > > > > Why is that a Good Thing? I very intentionally *didn't* say it was a good thing :). There are many ways to interpret the data, but I'm still glad we have it. > It's another piece of bureaucracy: 90% of hosts > were happy to be archived before You're presuming consent here. We don't actually *know* that said 90% of hosts are happy to be archived; we only know that 90% of hosts haven't included a robots.txt file, which could be for any one of a multitude of reasons.
(Received off-list, but I assume it was *meant* for the list, so replying there) On Wed, 2020-11-25 at 00:36 -0500, John Cowan wrote: > > I understand "user privacy" to mean the privacy of people using > clients. > What privacy do server operators expect to have, unless they are > using > client certs, firewalls, or other such blockers? Barring those, they > are > serving content to all the world. Yes, I've got a p.s. somewhere on the list around this potential objection. I don't think that server operators (perhaps better: "capsule authors") have been explicitly ruled in when talking about user privacy in gemini so far; but I also don't think they've been explicitly ruled out - it just hadn been a live issue until the first archiver showed up and (presumably in response to that) the robots.txt spec was published. I don't find it a stretch at all to see capsule authors as gemini users, but if we were to end up excluding them from the category for some reason, my proposal certainly looks a lot less interesting. Whatever the outcome of the opt-in vs opt-out part of this discussion, the robots.txt spec gives weight to the expressed preferences of capsule authors.. Crawler authors are being asked to respect those preferences, and one of the possible motivations for that is a recognition that the privacy of capsule authors is harmed by not respecting their preferences. Saying "I want to be in search indexes but not archives" is likely to be motivated by privacy concerns, and an explicit robots.txt is one way that I, as a capsule author, can expect to have privacy from archives. If it's true for people with an explicit preference, it can also be true for people who haven't expressed a preference yet. Since Gemini has a higher standard for user privacy than the web, it can also have a higher standard for these preferences - one that does not rely on presumed consent - if we want it to. > > As I understand it, archive.org does respect robots.txt in general, > > Not since 2018. See < > https://help.archive.org/hc/en-us/articles/360004651732-Using-The-Wayback-Machine>;, > which was updated 5 days ago and says: The FAQ immediately above the one you quoted reads: > Why isn't the site I'm looking for in the archive?* > Some sites may not be included because the automated crawlers were > unaware of their existence at the time of the crawl. It's also > possible that some sites were not archived because they were > password protected, blocked by robots.txt, or otherwise inaccessible > to our automated systems. Site owners might have also requested that > their sites be excluded from the Wayback Machine. If archive.org didn't respect robots.txt at all, it would lend a lot of flavour to the "archiver" virtual user-agent idea in the companion spec, in addition to this discussion. Do you still have doubts after reading this section? /Nick
On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote: > (Received off-list, but I assume it was *meant* for the list, so > replying there) > It was, so thanks. My private messages are labeled (Private message) at the top because I make this mistake a lot. Whatever the outcome of the opt-in vs opt-out part of this discussion, > That's the only part that concerns me. A robots.txt spec is good and crawlers/archivers that respect it are fine too, though of course some won't. I once wrote to the author of a magazine article who had published a simple crawler that it would hammer whatever server it was crawling, since it did not delay between requests or intersperse them with requests to other servers, but simply walked the server's tree depth-first. and that it should respect robots.txt. He wrote back saying "That's the Internet today; deal with it." I could have answered (but I didn't) that hits are a cost to the server operator, and anyone running his dumb crawler was not only DDOSing, but spending my money for his own purposes. But I do think that once robots.txt support is in place, no robots.txt = no expressed preference. If it's true for people with an explicit preference, it can also be > true for people who haven't expressed a preference yet. Since Gemini > has a higher standard for user privacy than the web, it can also have a > higher standard for these preferences - one that does not rely on > presumed consent - if we want it to. > By this logic, nobody should be able to access a Gemini server at all unless the capsule author has expressed a preference for them to do so. But to publish is to expose your work to the public. > The FAQ immediately above the one you quoted reads: > > > Why isn't the site I'm looking for in the archive?* > > > Some sites may not be included because the automated crawlers were > > unaware of their existence at the time of the crawl. It's also > > possible that some sites were not archived because they were > > password protected, blocked by robots.txt, or otherwise inaccessible > > to our automated systems. Site owners might have also requested that > > their sites be excluded from the Wayback Machine. > I interpret that to mean that some sites were not crawled during the period when the Archive was paying attention to robots.txt, and so their content as of that date is unavailable. Note the past tense: "were [...] protected by robots.txt" as opposed to "are protected". > If archive.org didn't respect robots.txt at all, it would lend a lot of > flavour to the "archiver" virtual user-agent idea in the companion > spec, in addition to this discussion. Do you still have doubts after > reading this section? > I have no doubt whatever that the crawler doesn't respect robots.txt. I could do a little experiment, though. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague. --Edsger Dijkstra -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201125/7813 ec46/attachment-0001.htm>
On Wed, 2020-11-25 at 10:10 -0500, John Cowan wrote: > On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote: > > If it's true for people with an explicit preference, it can also be > > true for people who haven't expressed a preference yet. Since > > Gemini > > has a higher standard for user privacy than the web, it can also > > have a > > higher standard for these preferences - one that does not rely on > > presumed consent - if we want it to. > > > > By this logic, nobody should be able to access a Gemini server at all > unless the capsule author has expressed a preference for them to do > so. > But to publish is to expose your work to the public. Browsing, indexing, research crawling, and archiving, are all distinct things with distinct impacts on capsule author privacy. This is why my opening email proposed that we retain presumed consent for browing via a proxy - it's a clear case of "one of these things is not like the others", and the same is true for individual browsing. This section was mostly aimed at establishing that capsule authors should be thought of as gemini users, so it took some shortcuts on presumed consent verbiage, which might not have been helpful. For clarity: I think it's fine to presume consent for browsing (whether through a proxy or not), and not fine to presume consent for archiving. If adopted, this represents a significant enhancement to capsule author privacy compared to web norms. Presumed consent for indexing, I'm actually fairly marginal about. I do think it's more appropriate to forbid than permit it, but not very strongly. > > > Why isn't the site I'm looking for in the archive?* > > > Some sites may not be included because the automated crawlers > > > were > > > unaware of their existence at the time of the crawl. It's also > > > possible that some sites were not archived because they were > > > password protected, blocked by robots.txt, or otherwise > > > inaccessible > > > to our automated systems. Site owners might have also requested > > > that > > > their sites be excluded from the Wayback Machine. > > I interpret that to mean that some sites were not crawled during the > period > when the Archive was paying attention to robots.txt, and so their > content > as of that date is unavailable. Note the past tense: "were [...] > protected by robots.txt" as opposed to "are protected". I don't see any space at all to read it like that, not least due to the references to "password protected" and "otherwise inaccessible" content in exactly the same tense. To me, it's crystal clear that the past tense is used here simply because the crawl happened in the past. I do have this blog post from April 2018, referencing archived blogs from December 2017, where robots.txt being respected is a plot point: https://blog.archive.org/2018/04/24/addressing-recent-claims-of-manipulated -blog-posts-in-the-wayback-machine/ The blog.archive.org rant about robots.txt not being suitable for archivers was April 2017. that's the one that mentions they may not respect robots.txt in-general in the future; I'd really very strongly expect a futher blog post to appear if they start taking steps in that direction. It would definitely be interesting if you had an experiment or reference demonstrating that archive.org ignores robots.txt in general, but this page simply isn't it. /Nick
On Tue, 24 Nov 2020 16:16:49 +0100 marc <marcx2 at welz.org.za> wrote: > Note that the apache people worry about just doing a > stat() for .htaccess along a path. This proposal requires an > opendir() for *every* directory in the exported hierarchy. Apache is designed to be able to serve large enterprises with high request loads. The cause for their concern seems unlikely to apply to multi-user Gemini hosts. > I concede that this isn't impossible - it is potentially expensive, > messy or nonstandard (and yes, there are inotify tricks or > serving the entire site out of a database, but that isn't a > common thing). It's very much a matter of implementation. For example, if high performance is a concern you can regenerate the information once per minute rather than on a per-request basis, or on request from the users, via a Gemini endpoint. That's however a good argument for an Allow directive corresponding to Disallow, to be able to disallow by default and only allowing resources lower down in the hierarchy explicitly, which allows for a "better safe than sorry" approach to "prevent" a crawler from picking up resources before the new robot rules have been picked up. > So I think this is the interesting bit of the discussion - > the tradeoff of keeping this information inside the file or > in a sidechannel. You are of course correct that not every > file format permits embedding such information, and that > is the one side of the tradeoff.... the other side is > the argument for persistence - having the data in another > file (or in a protocol header) means that is likely to be > lost. What you're proposing is doubly effective in that data that isn't there
It was thus said that the Great Nick Thomas once stated: > > > For clarity: I think it's fine to presume consent for browsing (whether > through a proxy or not), and not fine to presume consent for archiving. > If adopted, this represents a significant enhancement to capsule author > privacy compared to web norms. The issue with proxying (especially via the web) is the web side. Using a webproxy that runs locally to browse Gemini sites via a browser is fine, but it becomes problematic if said proxy is listening on a public IP address. It's not a matter of *if* but *WHEN* webbots of all types start hitting it, and *those* are a mixture of indexer, archiver, research and other [1]. At the very least, any web proxy should respond to "/robots.txt" and either serve up a file, or have command line options to generate a response to "/robots.txt" or at the very least (or as a default), send this: User-agent: * Disallow: / This is the crux of the diagreement between myself and Drew---I didn't explain my concerns very well, and he didn't pick up on the actual issue I had (so my fault here). A web proxy can inadvertently allow indexers, archivers, researchers and others access to Gemini content. -spc [1] Indexers, archivers and research bots tend to respect robots.txt. It's the "other" class that don't. These "other" bots are typically looking for exploits and there's not much you can do about these other than outright ban the IP they're coming from [2]. [2] And even then it's a game of "whack-a-mole", although if a web proxy sees a bunch of requests from a single IP address that result in a bunch of "not found" errors from Gemini (say, a threshhold of 10 such results in a row) then that IP is automatically banned for a period of time (say, 48 hours---enough to let it finish its job, but not forever since the list of IPs will grow).
On Wed, Nov 25, 2020 at 12:01 PM Nick Thomas <gemini at ur.gs> wrote: It would definitely be interesting if you had an experiment or > reference demonstrating that archive.org ignores robots.txt in general, > but this page simply isn't it. Okay, I rediscovered the page I was looking for: < https://webmasters.stackexchange.com/questions/71377/how-to-properly-disall ow-the-archive-org-bot-did-things-change-if-so-when/ >. Search on that page for "random item on eBay". This test shows that as of May 2017, the IA was supporting robots.txt. I tried this myself, and it shows three crawls, two in 2019 and one in 2020, that agree with what you see ("Unknown item") if you follow the direct link. This agrees with the claim that robots.txt was turned off in 2018 for all sites; however, apparently the IA is not announcing this. My guess (only a guess) is that IA thinks that people who don't want to be archived will start using less reliable mechanisms like blocking IP addresses. Now search farther down the page for "just did a quick test". This shows that as of March 2017 the IA was refusing to display pages put off-limits by robots.txt, consistently with the above. However, when the robots.txt entry was removed, crawls from 2010 through 2017 suddenly appeared! So even in the pre-2018 regime, the IA was crawling the pages but hiding them. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org If I have seen farther than others, it is because I was looking through a spyglass with my one good eye, with a parrot standing on my shoulder. --"Y" -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201125/d8d8 89c9/attachment.htm>
Gaah, ok, I hit reply instead of reply all... so my messages were sent directly to the people, lol. I'll repost them here: I want to point out that making the assumption that a lack of robots.txt is because servers don't mind they're content being archived is a leap in logic that doesn't actually follow/make sense. A server/user could have just forgotten to put a robots.txt, *or* they could have just not known about it. > A personal example: *I* didn't have a robots.txt on my capsule file until today, but I don't want to be included in archives for various reasons. Presuming consent from the lack of a robots.txt file would have incorrectly guessed my preference, and harmed my privacy. Who else in that 90% is like me? We don't know. Exactly! When I first got my server up, I didn't have a robots.txt for the longest time. Some of my content was actually not supposed to be archived because it was dynamic stuff. And other stuff I didn't necessarily want archived. Christian Seibold Sent with [ProtonMail](https://protonmail.com/) Secure Email. ??????? Original Message ??????? On Wednesday, November 25th, 2020 at 9:10 AM, John Cowan <cowan at ccil.org> wrote: > On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote: > >> (Received off-list, but I assume it was *meant* for the list, so >> replying there) > > It was, so thanks. My private messages are labeled (Private message) at the top because I make this mistake a lot. > >> Whatever the outcome of the opt-in vs opt-out part of this discussion, > > That's the only part that concerns me. A robots.txt spec is good and crawlers/archivers that respect it are fine too, though of course some won't. > > I once wrote to the author of a magazine article who had published a simple crawler that it would hammer whatever server it was crawling, since it did not delay between requests or intersperse them with requests to other servers, but simply walked the server's tree depth-first. and that it should respect robots.txt. He wrote back saying "That's the Internet today; deal with it." I could have answered (but I didn't) that hits are a cost to the server operator, and anyone running his dumb crawler was not only DDOSing, but spending my money for his own purposes. > > But I do think that once robots.txt support is in place, no robots.txt = no expressed preference. > >> If it's true for people with an explicit preference, it can also be >> true for people who haven't expressed a preference yet. Since Gemini >> has a higher standard for user privacy than the web, it can also have a >> higher standard for these preferences - one that does not rely on >> presumed consent - if we want it to. > > By this logic, nobody should be able to access a Gemini server at all unless the capsule author has expressed a preference for them to do so. But to publish is to expose your work to the public. > >> The FAQ immediately above the one you quoted reads: >> >>> Why isn't the site I'm looking for in the archive?* >> >>> Some sites may not be included because the automated crawlers were >>> unaware of their existence at the time of the crawl. It's also >>> possible that some sites were not archived because they were >>> password protected, blocked by robots.txt, or otherwise inaccessible >>> to our automated systems. Site owners might have also requested that >>> their sites be excluded from the Wayback Machine. > > I interpret that to mean that some sites were not crawled during the period when the Archive was paying attention to robots.txt, and so their content as of that date is unavailable. Note the past tense: "were [...] protected by robots.txt" as opposed to "are protected". > >> If archive.org didn't respect robots.txt at all, it would lend a lot of >> flavour to the "archiver" virtual user-agent idea in the companion >> spec, in addition to this discussion. Do you still have doubts after >> reading this section? > > I have no doubt whatever that the crawler doesn't respect robots.txt. I could do a little experiment, though. > > John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org > The competent programmer is fully aware of the strictly limited size of his own > skull; therefore he approaches the programming task in full humility, and among > other things he avoids clever tricks like the plague. --Edsger Dijkstra -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201126/ae3e 48ea/attachment.htm>
I'm not sure why Internet Archive matters here. Just because they do something doesn't mean it's the right thing to do. Seems like an appeal to authority to me. Christian Seibold Sent with [ProtonMail](https://protonmail.com/) Secure Email. ??????? Original Message ??????? On Wednesday, November 25th, 2020 at 9:10 AM, John Cowan <cowan at ccil.org> wrote: > On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote: > >> (Received off-list, but I assume it was *meant* for the list, so >> replying there) > > It was, so thanks. My private messages are labeled (Private message) at the top because I make this mistake a lot. > >> Whatever the outcome of the opt-in vs opt-out part of this discussion, > > That's the only part that concerns me. A robots.txt spec is good and crawlers/archivers that respect it are fine too, though of course some won't. > > I once wrote to the author of a magazine article who had published a simple crawler that it would hammer whatever server it was crawling, since it did not delay between requests or intersperse them with requests to other servers, but simply walked the server's tree depth-first. and that it should respect robots.txt. He wrote back saying "That's the Internet today; deal with it." I could have answered (but I didn't) that hits are a cost to the server operator, and anyone running his dumb crawler was not only DDOSing, but spending my money for his own purposes. > > But I do think that once robots.txt support is in place, no robots.txt = no expressed preference. > >> If it's true for people with an explicit preference, it can also be >> true for people who haven't expressed a preference yet. Since Gemini >> has a higher standard for user privacy than the web, it can also have a >> higher standard for these preferences - one that does not rely on >> presumed consent - if we want it to. > > By this logic, nobody should be able to access a Gemini server at all unless the capsule author has expressed a preference for them to do so. But to publish is to expose your work to the public. > >> The FAQ immediately above the one you quoted reads: >> >>> Why isn't the site I'm looking for in the archive?* >> >>> Some sites may not be included because the automated crawlers were >>> unaware of their existence at the time of the crawl. It's also >>> possible that some sites were not archived because they were >>> password protected, blocked by robots.txt, or otherwise inaccessible >>> to our automated systems. Site owners might have also requested that >>> their sites be excluded from the Wayback Machine. > > I interpret that to mean that some sites were not crawled during the period when the Archive was paying attention to robots.txt, and so their content as of that date is unavailable. Note the past tense: "were [...] protected by robots.txt" as opposed to "are protected". > >> If archive.org didn't respect robots.txt at all, it would lend a lot of >> flavour to the "archiver" virtual user-agent idea in the companion >> spec, in addition to this discussion. Do you still have doubts after >> reading this section? > > I have no doubt whatever that the crawler doesn't respect robots.txt. I could do a little experiment, though. > > John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org > The competent programmer is fully aware of the strictly limited size of his own > skull; therefore he approaches the programming task in full humility, and among > other things he avoids clever tricks like the plague. --Edsger Dijkstra -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201126/fd3e 3fb9/attachment-0001.htm>
One more thing I want to point out... copyright law isn't opt-in. It's opt-out. If you don't have a copyright statement or any other licensing information, then "all rights reserved" is automatically assumed, afaik. You can't just copy something just because the author didn't explicitly disallow you from doing that. Christian Seibold Sent with [ProtonMail](https://protonmail.com/) Secure Email. ??????? Original Message ??????? On Thursday, November 26th, 2020 at 12:12 AM, Krixano <krixano at protonmail.com> wrote: > I'm not sure why Internet Archive matters here. Just because they do something doesn't mean > it's the right thing to do. Seems like an appeal to authority to me. > > Christian Seibold > > Sent with [ProtonMail](https://protonmail.com/) Secure Email. > > ??????? Original Message ??????? > On Wednesday, November 25th, 2020 at 9:10 AM, John Cowan <cowan at ccil.org> wrote: > >> On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote: >> >>> (Received off-list, but I assume it was *meant* for the list, so >>> replying there) >> >> It was, so thanks. My private messages are labeled (Private message) at the top because I make this mistake a lot. >> >>> Whatever the outcome of the opt-in vs opt-out part of this discussion, >> >> That's the only part that concerns me. A robots.txt spec is good and crawlers/archivers that respect it are fine too, though of course some won't. >> >> I once wrote to the author of a magazine article who had published a simple crawler that it would hammer whatever server it was crawling, since it did not delay between requests or intersperse them with requests to other servers, but simply walked the server's tree depth-first. and that it should respect robots.txt. He wrote back saying "That's the Internet today; deal with it." I could have answered (but I didn't) that hits are a cost to the server operator, and anyone running his dumb crawler was not only DDOSing, but spending my money for his own purposes. >> >> But I do think that once robots.txt support is in place, no robots.txt = no expressed preference. >> >>> If it's true for people with an explicit preference, it can also be >>> true for people who haven't expressed a preference yet. Since Gemini >>> has a higher standard for user privacy than the web, it can also have a >>> higher standard for these preferences - one that does not rely on >>> presumed consent - if we want it to. >> >> By this logic, nobody should be able to access a Gemini server at all unless the capsule author has expressed a preference for them to do so. But to publish is to expose your work to the public. >> >>> The FAQ immediately above the one you quoted reads: >>> >>>> Why isn't the site I'm looking for in the archive?* >>> >>>> Some sites may not be included because the automated crawlers were >>>> unaware of their existence at the time of the crawl. It's also >>>> possible that some sites were not archived because they were >>>> password protected, blocked by robots.txt, or otherwise inaccessible >>>> to our automated systems. Site owners might have also requested that >>>> their sites be excluded from the Wayback Machine. >> >> I interpret that to mean that some sites were not crawled during the period when the Archive was paying attention to robots.txt, and so their content as of that date is unavailable. Note the past tense: "were [...] protected by robots.txt" as opposed to "are protected". >> >>> If archive.org didn't respect robots.txt at all, it would lend a lot of >>> flavour to the "archiver" virtual user-agent idea in the companion >>> spec, in addition to this discussion. Do you still have doubts after >>> reading this section? >> >> I have no doubt whatever that the crawler doesn't respect robots.txt. I could do a little experiment, though. >> >> John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org >> The competent programmer is fully aware of the strictly limited size of his own >> skull; therefore he approaches the programming task in full humility, and among >> other things he avoids clever tricks like the plague. --Edsger Dijkstra -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201126/625a dbc1/attachment.htm>
Hi Krixano, A few thoughts. First of all, please use plaintext email in future (https://useplaintext.email/#protonmail, as your signature clearly indicated you use protonmail I can give you the direct link). It's easier for a lot of us to access. Regarding your first email: November 26, 2020 1:09 AM, "Krixano" <krixano at protonmail.com> wrote: > I want to point out that making the assumption that a lack of robots.txt > is because servers don't mind they're content being archived is a leap in logic > that doesn't actually follow/make sense. A server/user could have just forgotten > to put a robots.txt, *or* they could have just not known about it. > >> A personal example: *I* didn't have a robots.txt on my capsule file until today, but I don't want > > to be included in archives for various reasons. Presuming consent from the lack of a robots.txt > file would have incorrectly guessed my preference, and harmed my privacy. Who else in that 90% is > like me? We don't know. > > Exactly! When I first got my server up, I didn't have a robots.txt for the longest time. Some of my > content was actually not supposed to be archived because it was dynamic stuff. And other stuff I > didn't necessarily want archived. It may be a leap of logic, sure, but it's the same leap of logic that has been all but codified as law (see court cases like Field v. Google, where judges have determined that not including a robots.txt or no-archive tag grants an implied license to archive). As stated in the case summary of Field v. Google: > Author granted operator of Internet search engine implied license to display "cached" links to web pages containing his copyrighted works when author consciously chose not to include no-archive meta-tag on pages of his website, despite knowing that including meta-tag would have informed operator not to display "cached" links to his pages and that absence of meta-tag would be interpreted by operator as permission to allow access to his web pages via "cached" links. Google's "cached" pages system is essentially an archive under a different name. Is it truly a leap of logic if even a court of law comes to the same decision? November 26, 2020 1:12 AM, "Krixano" <krixano at protonmail.com> wrote: > I'm not sure why Internet Archive matters here. Just because they do something doesn't mean > it's the right thing to do. Seems like an appeal to authority to me. It is an appeal to authority, but not a fallacious one. The Internet Archive is (as far as I know) the biggest archiving group on the Internet. If they do something, it's not entirely beyond reason to assume other people do the same. A wide variety of people who do archiving that I've spoken to have the same attitude: they'll still archive it, but they won't make the archive available to the public if they aren't supposed to. November 26, 2020 1:18 AM, "Krixano" <krixano at protonmail.com> wrote: > One more thing I want to point out... copyright law isn't opt-in. It's opt-out. > If you don't have a copyright statement or any other licensing information, > then "all rights reserved" is automatically assumed, afaik. You can't just copy > something just because the author didn't explicitly disallow you from doing that. Again, see above; the law is on the side of the person assuming robots.txt is a system for opting out of indexing/archiving/etc. Just my two cents, Robert "khuxkm" Miles
> Google's "cached" pages system is essentially an archive under a different name. Is it truly a leap of logic if even a court of law comes to the same decision? Yes, the court clearly made a leap in logic. Courts don't always follow logic, because it's not efficient to do so. Btw, the court case is only in the district of Nevada. And I'm honestly surprised by this, considering that you do *not* have to explicitly assert your copyright in order for copyright to apply. It seems this particular court thought caching was an exception, unfortunately. Pretty disgusting. Anyways, if I find any site archiving any of the stuff from my server, I'll be looking into DMCA takedowns, because I don't tolerate utter disrespect for users' content like that. It's disgusting. Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Thursday, November 26th, 2020 at 1:09 AM, Robert "khuxkm" Miles <khuxkm at tilde.team> wrote: > Hi Krixano, > > A few thoughts. First of all, please use plaintext email in future > > (https://useplaintext.email/#protonmail, as your signature clearly indicated you use protonmail I > > can give you the direct link). It's easier for a lot of us to access. > > Regarding your first email: > > November 26, 2020 1:09 AM, "Krixano" krixano at protonmail.com wrote: > > > I want to point out that making the assumption that a lack of robots.txt > > > > is because servers don't mind they're content being archived is a leap in logic > > > > that doesn't actually follow/make sense. A server/user could have just forgotten > > > > to put a robots.txt, or they could have just not known about it. > > > > > A personal example: I didn't have a robots.txt on my capsule file until today, but I don't want > > > > to be included in archives for various reasons. Presuming consent from the lack of a robots.txt > > > > file would have incorrectly guessed my preference, and harmed my privacy. Who else in that 90% is > > > > like me? We don't know. > > > > Exactly! When I first got my server up, I didn't have a robots.txt for the longest time. Some of my > > > > content was actually not supposed to be archived because it was dynamic stuff. And other stuff I > > > > didn't necessarily want archived. > > It may be a leap of logic, sure, but it's the same leap of logic that has been all but codified as > > law (see court cases like Field v. Google, where judges have determined that not including a > > robots.txt or no-archive tag grants an implied license to archive). As stated in the case summary of Field v. Google: > > > Author granted operator of Internet search engine implied license to display "cached" links to web pages containing his copyrighted works when author consciously chose not to include no-archive meta-tag on pages of his website, despite knowing that including meta-tag would have informed operator not to display "cached" links to his pages and that absence of meta-tag would be interpreted by operator as permission to allow access to his web pages via "cached" links. > > Google's "cached" pages system is essentially an archive under a different name. Is it truly a leap of logic if even a court of law comes to the same decision? > > November 26, 2020 1:12 AM, "Krixano" krixano at protonmail.com wrote: > > > I'm not sure why Internet Archive matters here. Just because they do something doesn't mean > > > > it's the right thing to do. Seems like an appeal to authority to me. > > It is an appeal to authority, but not a fallacious one. The Internet Archive is (as far as I know) > > the biggest archiving group on the Internet. If they do something, it's not entirely beyond reason > > to assume other people do the same. A wide variety of people who do archiving that I've spoken to > > have the same attitude: they'll still archive it, but they won't make the archive available to the > > public if they aren't supposed to. > > November 26, 2020 1:18 AM, "Krixano" krixano at protonmail.com wrote: > > > One more thing I want to point out... copyright law isn't opt-in. It's opt-out. > > > > If you don't have a copyright statement or any other licensing information, > > > > then "all rights reserved" is automatically assumed, afaik. You can't just copy > > > > something just because the author didn't explicitly disallow you from doing that. > > Again, see above; the law is on the side of the person assuming robots.txt is a system for opting out of indexing/archiving/etc. > > Just my two cents, > > Robert "khuxkm" Miles
My emails, afaik, should already be in plaintext. Not sure if something got messed up with a previous email, but anyways.... It shows Plain Text as being selected, so yeah. Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Thursday, November 26th, 2020 at 1:24 AM, Krixano <krixano at protonmail.com> wrote: > > Google's "cached" pages system is essentially an archive under a different name. Is it truly a leap of logic if even a court of law comes to the same decision? > > Yes, the court clearly made a leap in logic. Courts don't always follow logic, > > because it's not efficient to do so. > > Btw, the court case is only in the district of Nevada. And I'm honestly surprised > > by this, considering that you do not have to explicitly assert your copyright > > in order for copyright to apply. It seems this particular court thought caching > > was an exception, unfortunately. Pretty disgusting. > > Anyways, if I find any site archiving any of the stuff from my server, I'll be > > looking into DMCA takedowns, because I don't tolerate utter disrespect for users' > > content like that. It's disgusting. > > Christian Seibold > > Sent with ProtonMail Secure Email. > > ??????? Original Message ??????? > > On Thursday, November 26th, 2020 at 1:09 AM, Robert "khuxkm" Miles khuxkm at tilde.team wrote: > > > Hi Krixano, > > > > A few thoughts. First of all, please use plaintext email in future > > > > (https://useplaintext.email/#protonmail, as your signature clearly indicated you use protonmail I > > > > can give you the direct link). It's easier for a lot of us to access. > > > > Regarding your first email: > > > > November 26, 2020 1:09 AM, "Krixano" krixano at protonmail.com wrote: > > > > > I want to point out that making the assumption that a lack of robots.txt > > > > > > is because servers don't mind they're content being archived is a leap in logic > > > > > > that doesn't actually follow/make sense. A server/user could have just forgotten > > > > > > to put a robots.txt, or they could have just not known about it. > > > > > > > A personal example: I didn't have a robots.txt on my capsule file until today, but I don't want > > > > > > to be included in archives for various reasons. Presuming consent from the lack of a robots.txt > > > > > > file would have incorrectly guessed my preference, and harmed my privacy. Who else in that 90% is > > > > > > like me? We don't know. > > > > > > Exactly! When I first got my server up, I didn't have a robots.txt for the longest time. Some of my > > > > > > content was actually not supposed to be archived because it was dynamic stuff. And other stuff I > > > > > > didn't necessarily want archived. > > > > It may be a leap of logic, sure, but it's the same leap of logic that has been all but codified as > > > > law (see court cases like Field v. Google, where judges have determined that not including a > > > > robots.txt or no-archive tag grants an implied license to archive). As stated in the case summary of Field v. Google: > > > > > Author granted operator of Internet search engine implied license to display "cached" links to web pages containing his copyrighted works when author consciously chose not to include no-archive meta-tag on pages of his website, despite knowing that including meta-tag would have informed operator not to display "cached" links to his pages and that absence of meta-tag would be interpreted by operator as permission to allow access to his web pages via "cached" links. > > > > Google's "cached" pages system is essentially an archive under a different name. Is it truly a leap of logic if even a court of law comes to the same decision? > > > > November 26, 2020 1:12 AM, "Krixano" krixano at protonmail.com wrote: > > > > > I'm not sure why Internet Archive matters here. Just because they do something doesn't mean > > > > > > it's the right thing to do. Seems like an appeal to authority to me. > > > > It is an appeal to authority, but not a fallacious one. The Internet Archive is (as far as I know) > > > > the biggest archiving group on the Internet. If they do something, it's not entirely beyond reason > > > > to assume other people do the same. A wide variety of people who do archiving that I've spoken to > > > > have the same attitude: they'll still archive it, but they won't make the archive available to the > > > > public if they aren't supposed to. > > > > November 26, 2020 1:18 AM, "Krixano" krixano at protonmail.com wrote: > > > > > One more thing I want to point out... copyright law isn't opt-in. It's opt-out. > > > > > > If you don't have a copyright statement or any other licensing information, > > > > > > then "all rights reserved" is automatically assumed, afaik. You can't just copy > > > > > > something just because the author didn't explicitly disallow you from doing that. > > > > Again, see above; the law is on the side of the person assuming robots.txt is a system for opting out of indexing/archiving/etc. > > > > Just my two cents, > > > > Robert "khuxkm" Miles
> > Google's "cached" pages system is essentially an archive under a different name. Is it truly a leap of logic if even a court of law comes to the same decision? > > Yes, the court clearly made a leap in logic. Courts don't always follow logic, > because it's not efficient to do so. Courts don't always follow logic, but they often follow precedent. > Anyways, if I find any site archiving any of the stuff from my server, I'll be > looking into DMCA takedowns, because I don't tolerate utter disrespect for users' > content like that. It's disgusting. And a DMCA takedown notice is a legal measure, which needs a legal footing. And it would get that by using robots.txt files as established in precedent.
The court case (Field v. Google) was only in the district of Nevada. It doesn't apply to all of the US, and it doesn't apply to people outside of the US. Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Thursday, November 26th, 2020 at 1:32 AM, Bj?rn W?rmedal <bjorn.warmedal at gmail.com> wrote: > > > Google's "cached" pages system is essentially an archive under a different name. Is it truly a leap of logic if even a court of law comes to the same decision? > > > > Yes, the court clearly made a leap in logic. Courts don't always follow logic, > > > > because it's not efficient to do so. > > Courts don't always follow logic, but they often follow precedent. > > > Anyways, if I find any site archiving any of the stuff from my server, I'll be > > > > looking into DMCA takedowns, because I don't tolerate utter disrespect for users' > > > > content like that. It's disgusting. > > And a DMCA takedown notice is a legal measure, which needs a legal > > footing. And it would get that by using robots.txt files as > > established in precedent.
November 26, 2020 2:24 AM, "Krixano" <krixano at protonmail.com> wrote: >> Google's "cached" pages system is essentially an archive under a different name. Is it truly a leap >> of logic if even a court of law comes to the same decision? > > Yes, the court clearly made a leap in logic. Courts don't always follow logic, > because it's not efficient to do so. > > Btw, the court case is only in the district of Nevada. And I'm honestly surprised > by this, considering that you do *not* have to explicitly assert your copyright > in order for copyright to apply. It seems this particular court thought caching > was an exception, unfortunately. Pretty disgusting. > > Anyways, if I find any site archiving any of the stuff from my server, I'll be > looking into DMCA takedowns, because I don't tolerate utter disrespect for users' > content like that. It's disgusting. > > Christian Seibold Obviously, Bj?rn's counter-argument is correct. The courts follow precedent, and this precedent already exists. The only thing I want to add is: notice how the plaintiff Field didn't appeal. If he truly had a case, like you seem to believe he did, surely he would have appealed? Just my two cents, Robert "khuxkm" Miles
He didn't have a case because courts rule on multiple things, not just one thing. Stop trying to twist information. This is what the court ruled: -------------------------------------- The District Court, Jones, J., held that: 1.) Operator did not directly infringe on author's copyrighted works; 2.) Author granted operator implied license to display "cached" links to web pages containing his copyrighted works; 3.) Author was estopped from asserting copyright infringement claim against operator; 4.) Fair use doctrine protected operator's use of author's works; and 5.) Search engine fell within protection of safe harbor provision of Digital Millennium Copyright Act (DMCA). Summary judgment for operator. The court held that "Field decided to manufacture a claim for copyright infringement against Google in the hopes of making-money from Google's standard practice." The court then went on to rule in Google's favor on all of its defense theories. -------------------------------------- What does this tell us? It tells us that even if he won the implied license, he would have lost the case anyways because Google had Fair Use. Anyways, you're the one who brought up this court case, not me. I don't agree with the court, and I don't have to agree with the court, and neither does any other gemini user. Mind you, the spec isn't for legality, it's for gemini users and what they think. The gemini spec won't affect any legal things at all. Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Thursday, November 26th, 2020 at 1:41 AM, Robert "khuxkm" Miles <khuxkm at tilde.team> wrote: > November 26, 2020 2:24 AM, "Krixano" krixano at protonmail.com wrote: > > > > Google's "cached" pages system is essentially an archive under a different name. Is it truly a leap > > > > > > of logic if even a court of law comes to the same decision? > > > > Yes, the court clearly made a leap in logic. Courts don't always follow logic, > > > > because it's not efficient to do so. > > > > Btw, the court case is only in the district of Nevada. And I'm honestly surprised > > > > by this, considering that you do not have to explicitly assert your copyright > > > > in order for copyright to apply. It seems this particular court thought caching > > > > was an exception, unfortunately. Pretty disgusting. > > > > Anyways, if I find any site archiving any of the stuff from my server, I'll be > > > > looking into DMCA takedowns, because I don't tolerate utter disrespect for users' > > > > content like that. It's disgusting. > > > > Christian Seibold > > Obviously, Bj?rn's counter-argument is correct. The courts follow precedent, and this precedent already exists. > > The only thing I want to add is: notice how the plaintiff Field didn't appeal. If he truly had a case, like you seem to believe he did, surely he would have appealed? > > Just my two cents, > > Robert "khuxkm" Miles
November 26, 2020 2:39 AM, "Krixano" <krixano at protonmail.com> wrote: > The court case (Field v. Google) was only in the district of Nevada. It doesn't apply > to all of the US, and it doesn't apply to people outside of the US. A precedent is a precedent is a precedent. A district court is, in fact, a federal court, meaning that any district court in the US could see Field as a precedent they should follow. I'm sure there are other cases like Field v Google that hold the same thing to be true, even in a European court; it's just that Field v Google was brought up earlier in this thread, so it's the one I know about. Just my two cents, Robert "khuxkm" Miles
I never argued it wasn't a precedent. However, it hasn't gone up to the supreme court yet, who is the final arbiter for federal concerns. Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Thursday, November 26th, 2020 at 1:48 AM, Robert "khuxkm" Miles <khuxkm at tilde.team> wrote: > November 26, 2020 2:39 AM, "Krixano" krixano at protonmail.com wrote: > > > The court case (Field v. Google) was only in the district of Nevada. It doesn't apply > > > > to all of the US, and it doesn't apply to people outside of the US. > > A precedent is a precedent is a precedent. A district court is, in fact, a federal court, meaning > > that any district court in the US could see Field as a precedent they should follow. > > I'm sure there are other cases like Field v Google that hold the same thing to be true, even in a European court; it's just > > that Field v Google was brought up earlier in this thread, so it's the one I know about. > > Just my two cents, > > Robert "khuxkm" Miles
November 26, 2020 2:47 AM, "Krixano" <krixano at protonmail.com> wrote: > He didn't have a case because courts rule on multiple things, not just one thing. > Stop trying to twist information. This is what the court ruled: I'm not trying to twist information. I feel like your argument hinges on him having been able to also successfully argue the fair use angle. > What does this tell us? It tells us that even if he won the implied license, > he would have lost the case anyways because Google had Fair Use. So an archive counts as fair use then. A non commercial archive can use Field as precedent: it's for archival purposes, the work is available for free online, it may be a complete archive but the full work is available for free online, and there's no market for someone's random prose that they make available for free. Ergo, anyone can make an archive of anything they aren't explicitly told not to via robots.txt (at least in the US) and get away with it. > Anyways, you're the one who brought up this court case, not me. I don't agree with > the court, and I don't have to agree with the court, and neither does any other > gemini user. Mind you, the spec isn't for legality, it's for gemini users and what > they think. The gemini spec won't affect any legal things at all. Okay, but "gemini users and what they think" won't matter. The only place to seek relief is a court of law, and the court of law is firmly against you here. While I was drafting this you responded to my other email, so I'll merge the two replies here: November 26, 2020 2:50 AM, "Krixano" <krixano at protonmail.com> wrote: > I never argued it wasn't a precedent. However, it hasn't gone up to the > supreme court yet, who is the final arbiter for federal concerns. Well, if the case never made it to the Supreme Court, then the lower court's ruling stands. Ergo, it's still a precedent and most courts in the US would still follow it. Just my two cents, Robert "khuxkm" Miles
It was thus said that the Great Krixano once stated: > > Exactly! When I first got my server up, I didn't have a robots.txt for the > longest time. Some of my content was actually not supposed to be archived > because it was dynamic stuff. And other stuff I didn't necessarily want > archived. It is weird to think of autonomous agents crawling the Internet, but they exist. They can make requests just as humans (using a program) can make requests. The server has no concept of who or what is behind any given request, and this is expecially true for Gemini (as it has no concept of a user-agent identifier being sent). This was a problem with HTTP in the early days as well, and in 1994 (only five years after HTTP was created) an ad-hoc method was developed to help guide autonomous agents in avoiding particular areas that could lead to infinite holes of requests. Yes, it's sad that you had to learn about this the hard way. Yes, the Gemini spec should make mention of the robots.txt standard, and perhaps servers can issue a warning if a robots.txt file is missing. Or perhaps they can include a sample robots.txt file for the end user to modify. I just recently added a sample robots.txt file to my server source code [1]. I first learned of robots.txt in the 90s. I started seeing requests to "/robots.txt" in the logs, and curious about it, found it was an ad-hoc standard to control autonomous agents. I wonder if making an autonomous agent to *just* request /robots.txt, making it show up in logs [2], will do any good. This is how I also found out about /humans.txt [3] (and about a bazillion ways a web server can be exploited, but I digress). -spc [1] https://github.com/spc476/GLV-1.12556 It's under the share directory. But I can see that I should clarify one of the comments in that file, because it will only block autonomous agents that follow robots.txt, as it's advisory and not something that can be automatically enforced. [2] I know logging is also pretty contentious in Geminispace. [3] http://humanstxt.org/
First of all, lets not conflate a spec with law. The spec doesn't have to follow law. A spec is a guideline, it doesn't have to match law, and it doesn't have to be adhered to either. Secondly, let's actually look at what the court ruled here, on the implied license front: > consent to use the copyrighted work need not be manifested verbally and may be inferred based on silence where the copyright holder knows of the use and encourages it. Notice the "where the copyright holder knows of the use and encourages it." That's not necessarily the case in this discussion. It was the case in that court case. That court case literally doesn't apply here. Especially since Field explicitly added code so that search engines would index *the URL* of the page. This is not the case in this discussion as the absence of robots.txt would *not* be explicitly allowing search engines to index the URL of the page, and each server that doesn't have a robots.txt would not "know of the use and encourage it". Finally, precedents can be challenged by the Supreme Court. For example, the current Supreme Court case of Google v. Oracle dismissed everything the district courts and the Circuits had to say, because the Supreme Court looks at things freshly. Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Thursday, November 26th, 2020 at 1:57 AM, Robert "khuxkm" Miles <khuxkm at tilde.team> wrote: > November 26, 2020 2:47 AM, "Krixano" krixano at protonmail.com wrote: > > > He didn't have a case because courts rule on multiple things, not just one thing. > > > > Stop trying to twist information. This is what the court ruled: > > I'm not trying to twist information. I feel like your argument hinges on him having been able to > > also successfully argue the fair use angle. > > > What does this tell us? It tells us that even if he won the implied license, > > > > he would have lost the case anyways because Google had Fair Use. > > So an archive counts as fair use then. A non commercial archive can use Field as precedent: it's > > for archival purposes, the work is available for free online, it may be a complete archive but the > > full work is available for free online, and there's no market for someone's random prose that they > > make available for free. > > Ergo, anyone can make an archive of anything they aren't explicitly told not to via robots.txt (at > > least in the US) and get away with it. > > > Anyways, you're the one who brought up this court case, not me. I don't agree with > > > > the court, and I don't have to agree with the court, and neither does any other > > > > gemini user. Mind you, the spec isn't for legality, it's for gemini users and what > > > > they think. The gemini spec won't affect any legal things at all. > > Okay, but "gemini users and what they think" won't matter. The only place to seek relief is a court > > of law, and the court of law is firmly against you here. > > While I was drafting this you responded to my other email, so I'll merge the two replies here: > > November 26, 2020 2:50 AM, "Krixano" krixano at protonmail.com wrote: > > > I never argued it wasn't a precedent. However, it hasn't gone up to the > > > > supreme court yet, who is the final arbiter for federal concerns. > > Well, if the case never made it to the Supreme Court, then the lower court's ruling stands. Ergo, it's still a precedent and most courts in the US would still follow it. > > Just my two cents, > > Robert "khuxkm" Miles
This conversation is getting away from Gemini, so I'm going to wrap it up here and let us agree to disagree. November 26, 2020 3:11 AM, "Krixano" <krixano at protonmail.com> wrote: > First of all, lets not conflate a spec with law. The spec > doesn't have to follow law. A spec is a guideline, it doesn't > have to match law, and it doesn't have to be adhered to either. Okay but if you wanted something for the law being broken (i.e; your copyright being infringed), you have to go in front of a court of law. > Secondly, let's actually look at what the court ruled here, on the implied license front: > >> consent to use the copyrighted work need not be manifested verbally and may be inferred based on >> silence where the copyright holder knows of the use and encourages it. > > Notice the "where the copyright holder knows of the use and encourages it." > That's not necessarily the case in this discussion. It was the case in that court case. > That court case literally doesn't apply here. Especially since Field explicitly added code > so that search engines would index *the URL* of the page. This is not the case in this discussion > as the > absence of robots.txt would *not* be explicitly allowing search engines to index the URL of the > page, and each > server that doesn't have a robots.txt would not "know of the use and encourage it". I don't know where you got the idea that Field added code to make the engine index the URL-- that's what a search engine does-- but I don't care at this point. > Finally, precedents can be challenged by the Supreme Court. For example, the current Supreme Court > case of Google v. Oracle dismissed everything the district courts and the Circuits had to say, > because the Supreme Court looks at things freshly. Google v Oracle is an *ongoing* case. No precedent was set, because the case never actually came to rest. See the EFF's page on it: https://www.eff.org/cases/oracle-v-google Just my two cents, Robert "khuxkm" Miles
Yes, it's an ongoing case, but I actually read the whole case, and I'm almost 100% positive they are going to rule more in favor of Oracle, because Google made stupid claims, one of which is that software is patentable, btw. If you want to learn more about this, I would suggest this video series: https://caseorcontroversy.com/ Btw, there *was* precedent set in the lower districts of this case. To say there was no precedent set is misinformation. Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Thursday, November 26th, 2020 at 2:22 AM, Robert "khuxkm" Miles <khuxkm at tilde.team> wrote: > This conversation is getting away from Gemini, so I'm going to wrap it up here and let us agree to disagree. > > November 26, 2020 3:11 AM, "Krixano" krixano at protonmail.com wrote: > > > First of all, lets not conflate a spec with law. The spec > > > > doesn't have to follow law. A spec is a guideline, it doesn't > > > > have to match law, and it doesn't have to be adhered to either. > > Okay but if you wanted something for the law being broken (i.e; your copyright being infringed), you have to go in front of a court of law. > > > Secondly, let's actually look at what the court ruled here, on the implied license front: > > > > > consent to use the copyrighted work need not be manifested verbally and may be inferred based on > > > > > > silence where the copyright holder knows of the use and encourages it. > > > > Notice the "where the copyright holder knows of the use and encourages it." > > > > That's not necessarily the case in this discussion. It was the case in that court case. > > > > That court case literally doesn't apply here. Especially since Field explicitly added code > > > > so that search engines would index the URL of the page. This is not the case in this discussion > > > > as the > > > > absence of robots.txt would not be explicitly allowing search engines to index the URL of the > > > > page, and each > > > > server that doesn't have a robots.txt would not "know of the use and encourage it". > > I don't know where you got the idea that Field added code to make the engine index the URL-- that's what a search engine does-- but I don't care at this point. > > > Finally, precedents can be challenged by the Supreme Court. For example, the current Supreme Court > > > > case of Google v. Oracle dismissed everything the district courts and the Circuits had to say, > > > > because the Supreme Court looks at things freshly. > > Google v Oracle is an ongoing case. No precedent was set, because the case never actually came to rest. See the EFF's page on it: > > https://www.eff.org/cases/oracle-v-google > > Just my two cents, > > Robert "khuxkm" Miles
Correction, google made the case that APIs are patentable. Same difference, but still. Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Thursday, November 26th, 2020 at 2:25 AM, Krixano <krixano at protonmail.com> wrote: > Yes, it's an ongoing case, but I actually read the whole case, and I'm almost 100% positive they are going to rule more in favor of Oracle, because Google made stupid claims, one of which is that software is patentable, btw. > > If you want to learn more about this, I would suggest this video series: https://caseorcontroversy.com/ > > Btw, there was precedent set in the lower districts of this case. To say there was no precedent set is misinformation. > > Christian Seibold > > Sent with ProtonMail Secure Email. > > ??????? Original Message ??????? > > On Thursday, November 26th, 2020 at 2:22 AM, Robert "khuxkm" Miles khuxkm at tilde.team wrote: > > > This conversation is getting away from Gemini, so I'm going to wrap it up here and let us agree to disagree. > > > > November 26, 2020 3:11 AM, "Krixano" krixano at protonmail.com wrote: > > > > > First of all, lets not conflate a spec with law. The spec > > > > > > doesn't have to follow law. A spec is a guideline, it doesn't > > > > > > have to match law, and it doesn't have to be adhered to either. > > > > Okay but if you wanted something for the law being broken (i.e; your copyright being infringed), you have to go in front of a court of law. > > > > > Secondly, let's actually look at what the court ruled here, on the implied license front: > > > > > > > consent to use the copyrighted work need not be manifested verbally and may be inferred based on > > > > > > > > silence where the copyright holder knows of the use and encourages it. > > > > > > Notice the "where the copyright holder knows of the use and encourages it." > > > > > > That's not necessarily the case in this discussion. It was the case in that court case. > > > > > > That court case literally doesn't apply here. Especially since Field explicitly added code > > > > > > so that search engines would index the URL of the page. This is not the case in this discussion > > > > > > as the > > > > > > absence of robots.txt would not be explicitly allowing search engines to index the URL of the > > > > > > page, and each > > > > > > server that doesn't have a robots.txt would not "know of the use and encourage it". > > > > I don't know where you got the idea that Field added code to make the engine index the URL-- that's what a search engine does-- but I don't care at this point. > > > > > Finally, precedents can be challenged by the Supreme Court. For example, the current Supreme Court > > > > > > case of Google v. Oracle dismissed everything the district courts and the Circuits had to say, > > > > > > because the Supreme Court looks at things freshly. > > > > Google v Oracle is an ongoing case. No precedent was set, because the case never actually came to rest. See the EFF's page on it: > > > > https://www.eff.org/cases/oracle-v-google > > > > Just my two cents, > > > > Robert "khuxkm" Miles
On 25-Nov-2020 00:18, Nick Thomas wrote: > > You're presuming consent here. We don't actually *know* that said 90% > of hosts are happy to be archived; we only know that 90% of hosts > haven't included a robots.txt file, which could be for any one of a > multitude of reasons. > > *If* a not-insignificant proportion of those hosts without robots.txt > files would actually prefer not to be included in archives when asked, > the current situation is not serving their privacy well, and gemini is > suppose to be protective of user privacy. *If* an overwhelming majority > of them simply don't care, then sure, the argument for it starts to > look a bit niche. Talking in IRC earlier today, I hand-waved a 5% > threshold for the first condition and 1% for the second. > > A personal example: *I* didn't have a robots.txt on my capsule file > until today, but I don't want to be included in archives for various > reasons. Presuming consent from the lack of a robots.txt file would > have incorrectly guessed my preference, and harmed my privacy. Who else > in that 90% is like me? We don't know. > Hello all Personally, I'm not really that interested in the legal arguments back and forth about archiving and access. Yes there are some legal case precedents in this area in some jurisdictions, but I would say that by and large that ship has sailed. Sorry about that folks. The web is the de-facto baseline reference in this respect, whether we like it or not. If you *publish* information on the internet, there *will* be actors who will re-purpose it. Gemini is no different to the web in this. If any of us have information that is to be preserved as private, I cannot see how you can expect that to be achieved if you publish on the public internet (i.e. servers that do not require authentication). If you want to hide something, use authentication or a private channel. Yes there is robots.txt which is an opt-out mechanism, from general robot access to a server's content. It is established practice and good actors will respect it. But it cannot be a mechanism to preserve privacy. My take on the whole "Gemini preserves privacy better" is really about clients. We don't have extended headers, cookies or agent names in requests. So to that extent, client privacy is maintained better than the web, where the expectation is of long term, cross-session tracking. We dont thankfully have that. I don't see it as Gemini's role to attempt to set a cultural/legal privacy framework for servers who are choosing to publish on Gemini. We cannot imagine we can break new ground in this respect. We can however do our efforts to have this as a side effect of technical design in the protocol itself, and within the Gemini community we can look out for risks in exposing such personal information via the protocol. If Gemini ever becomes interesting enough to the outside world that some case goes to court (what a publicity success that would be!), surely the existing infrastructure of public server hypertext systems, namely the web, will be the established precedent. So I support use of robots.txt, but if none exists, the presumption - like the web - is that access and usage is allowed. If some actor doesn't follow a server's robots.txt, I'm sad about it, but we should ultimately expect it. - Luke
Hello Christian > One more thing I want to point out... copyright law isn't opt-in. It's opt-out. > If you don't have a copyright statement or any other licensing information, > then "all rights reserved" is automatically assumed, afaik. You can't just copy > something just because the author didn't explicitly disallow you from doing that. Yes - copyright legislation hasn't been repealed :-)
My arguments weren't just about privacy. They were also about copyright. Sharing on the internet is fine, but copyright still applies. Secondly, You can share something for free online for a short period of time, and then remove it after that time limit. This was done with a lot of books during a portion of the Covid pandemic we are in. To say that archives should be able to permanently cache this without explicit permission makes no logical sense. Anyways, back to my original argument, caching should be opt-in. It makes the most sense.
> *But* by putting things on the web, the creator has granted the world some implied license. This is not true. The only implied license is to view the thing put online. Redistributing it is not implied by putting something online, and neither is modifying, unless it's under Fair Use (a transformative work). Christian Seibold Sent with ProtonMail Secure Email. ??????? Original Message ??????? On Thursday, November 26th, 2020 at 4:18 AM, marc <marcx2 at welz.org.za> wrote: > Hello Christian > > > One more thing I want to point out... copyright law isn't opt-in. It's opt-out. > > > > If you don't have a copyright statement or any other licensing information, > > > > then "all rights reserved" is automatically assumed, afaik. You can't just copy > > > > something just because the author didn't explicitly disallow you from doing that. > > Yes - copyright legislation hasn't been repealed :-) > > But by putting things on the web, the creator has granted the > > world some implied license. The convention which has evolved for > > the web is that without a robots.txt forbidding it, crawlers > > are free to index and cache, and some other things too. The > > boundaries of this are fuzzy, because the conditions weren't > > stated at the outset. > > But gemini isn't the web, and gemini is new, so maybe we can > > do better and not rely on an implied license (all humans may > > visit this capsule), and then a robots.txt for just one single > > bit of extra information (autonomous software can crawl it too, > > if not forbidden). > > So many thoughtful people are hesitant to put their data > > online - they fear that this may disadvantage them in > > future - maybe they worry about employer discrimination, doxxing > > or biometric harvesting (from facial detail to writing style) > > or things not yet invented. > > Given that everybody has different tolerances, a mechanism > > whereby people can state their preferences would be a good > > thing. > > Blindly copying the web robots.txt mechanism seems to be too > > coarse/too vague, and too easily decoupled. > > regards > > marc > > -- CC-SA
On 11/26/20 10:27 AM, Krixano wrote: >> *But* by putting things on the web, the creator has granted the > world some implied license. > > This is not true. The only implied license is to view > the thing put online. Redistributing it is not implied by putting > something online, and neither is modifying, unless it's > under Fair Use (a transformative work). > > Christian Seibold Wow this thread blew up overnight. Anyway, I was the one that first posted about Field v. Google as one example case about litigation related to search engines and copyright. In an effort to avoid more "someone is wrong on the internet" arguments, here's the crux: - If you as a copyright holder want to deny your content being cached and served by a 3rd party (for instance a search engine) you have a well known mechanism to do so in robots.txt. - If your content is archived or cached against your desires your means of remediation are legal ones. Taking the issue to court will result in a court deciding if you are within your rights to protect your content or if the searcher/archiver/indexer is under fair use. - The rules around copyright and media protections are established in each country, but are nearly universally applied worldwide via the Berne Convention and/or agreements like the Electronic Commerce Directive. - Existing legal precedent suggests you can expect a ruling in favor of implied consent if you do not have a robots.txt. All of this is to suggest we save ourselves the trouble down the road and just use robots.txt as-is. Finally, and completely unrelated to everything: it was Oracle who tried to claim their APIs via patent rather than the other way around. See: https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc.#First_phas e:_API_copyrightability_and_patents
Hi > I don't see it as Gemini's role to attempt to set a cultural/legal privacy > framework for servers who are choosing to publish on Gemini. We cannot > imagine we can break new ground in this respect. That seems ... rather defeatist. Alasdair Gray provides an inspirational quote for a situation like this: "Work as if you live in the early days of a better nation" (apparently later he wanted to say world, but nation had stuck...) Gemini is still a young project, where a different culture and nicer norms could be established... Long ago, before the web, when the internet was young somebody grabbed the jokes from rec.humor.funny (I think, might have been another newsgroup) and published them in book form. Some posters were outraged at the copyright violation, others flattered. Had the individual posters just had a way of telling us how their material could have been re-used, there would have been no controversy, and maybe this would have laid the groundwork for a different way of aggregating online material, with internet editors neatly assembling "best-ofs" or "my conversations-with-..." and people optimising their comments for quotability or adding footnotes and expansions to posts they were keen to improve... instead of just feeble likes. TLDR: I can imagine it. regards marc
On 26-Nov-2020 16:24, marc wrote: >> I don't see it as Gemini's role to attempt to set a cultural/legal privacy >> framework for servers who are choosing to publish on Gemini. We cannot >> imagine we can break new ground in this respect. > That seems ... rather defeatist. > > Alasdair Gray provides an inspirational quote for a situation like this: > > "Work as if you live in the early days of a better nation" Well, I wasn't expecting to have my Utopian credentials questioned ;-) After all, I am a proponent of Gemini like everyone else here, pushing against the flow. But its true I'm probably towards the pragmatic end of the scale, and I like to see people discussing subjects I find to be productive. Trying to establish alternative IPR legal precedents, contrary to the flow of what happens on the web seems like a lot of work to me and we can spin a lot of cycles doing so. But if it rings your bell, by all means continue. I'm all for building a nice culture among the gemini-folk, but wider cultural changes happen slowly in my experience. - Luke
---