robots.txt for Gemini formalised

1. Solderpunk (solderpunk (a) posteo.net)

Hi folks,

There is now (finally!) an official reference on the use of robots.txt
files in Geminispace.  Please see:

gemini://gemini.circumlunar.space/docs/companion/robots.gmi

I attempted to take into account previous discussions on the mailing
list and the currently declared practices of various well-known Gemini
bots (broadly construed).

I don't consider this "companion spec" to necessarily be finalised at
this point, but I am primarily interested in hearing suggestions for
change from either authors of software which tries to respect robots.txt
who are having problems caused by the current specification, or from
server admins who are having bot problems who feel that the current
specification is not working for them.

The biggest gap that I can currently see is that there is no advice on
how often bots should re-query robots.txt to check for policy changes.
I could find no clear advice on this for the web, either.  I would be
happy to hear from people who've written software that uses robots.txt
with details on what their current practices are in this respect.

Cheers,
Solderpunk

Link to individual message.

2. Sean Conner (sean (a) conman.org)

It was thus said that the Great Solderpunk once stated:
> Hi folks,
> 
> There is now (finally!) an official reference on the use of robots.txt
> files in Geminispace.  Please see:
> 
> gemini://gemini.circumlunar.space/docs/companion/robots.gmi

  Nice.

> I attempted to take into account previous discussions on the mailing
> list and the currently declared practices of various well-known Gemini
> bots (broadly construed).
> 
> I don't consider this "companion spec" to necessarily be finalised at
> this point, but I am primarily interested in hearing suggestions for
> change from either authors of software which tries to respect robots.txt
> who are having problems caused by the current specification, or from
> server admins who are having bot problems who feel that the current
> specification is not working for them.

  Right now, there are two things I would change.

	1. Add "allow".  While the initial spec [1] did not have an allow
	   rule, a subsequent draft proposal [2] did, which Google is
	   pushing (as of 2019) to become an RFC [3].

	2. I would specify virtual agents as:

		Virtual-agent: archiver
		Virtual-agent: indexer

	   This makes it easier to add new virtual agents, separates the
	   namespace of agents from the namespace of virtual agents, and is
	   allowed by all current and proposed standards [4].

	   The rule I would follow is:

		Definitions:  
			specific user agent is one that is not '*'
			specific virtual agent is one that is not '*'
			generic user agent is one that is specified as '*'
			generic virtual agent is one that is '*'

		A crawler should use a block of rules:

			if it finds a specific user agent (most targetted)
			or it finds a specific virtual agent
			or it finds a generic virtual agent
			or it finds a generic user agent (least targetted)

	   I'm wavering on the generic virtual agent bit, so if you think
	   that makes this too complicated, fine, I think it can go.

> The biggest gap that I can currently see is that there is no advice on
> how often bots should re-query robots.txt to check for policy changes.
> I could find no clear advice on this for the web, either.  I would be
> happy to hear from people who've written software that uses robots.txt
> with details on what their current practices are in this respect.

  The Wikipedia page [5] lists a non-standard extension "Crawl-delay" which
informs a crawler how often they should make requests.  It might be easy to
add a field saying how often to fetch a resource.  A sample file:

# The GUS agent, plus any agent that identifies as an "indexer" is allowed
# one path in an otherwise disallowed place, and only fetch items in 10
# second increments.

User-agent: GUS
Virtual-agent: indexer
Allow: /private/butpublic
Disallow: /private
Crawl-delay: 10

# Agents that fetch feeds, should only grab every 6 hours.  "Check" is
# allowed as agents should ignore fields it doesn't understand.

Virtual-agent: feed
Disallow: /private
Check: 21600

# And a fallback.  Here we don't allow any old crawler into the private
# space, and we force them to use 20 seonds between fetches.

User-agent: *
Disallow: /private
Crawl-delay: 20

  -spc

[1]	gemini://gemini.circumlunar.space/docs/companion/robots.gmi

[2]	http://www.robotstxt.org/norobots-rfc.txt

[3]	https://developers.google.com/search/reference/robots_txt

[4]	Any field not understood by a crawler should be ignored.

[5]	https://en.wikipedia.org/wiki/Robots_exclusion_standard

Link to individual message.

3. Drew DeVault (sir (a) cmpwn.com)

Feedback:

A web portal is a regular user agent, not a robot.

Maybe we could normalize robots fetching robots.txt with the query
string set to some useful identifiying information? This would allow
gemini administrators to make bot-specific rules, understand the
behavior of their logs, and get in touch with the operator if
necessary.

Link to individual message.

4. John Cowan (cowan (a) ccil.org)

On Sun, Nov 22, 2020 at 6:03 PM Drew DeVault <sir at cmpwn.com> wrote:


> A web portal is a regular user agent, not a robot.
>

Agreed.  However, The spec says "publicly serve the result", and a *public*
proxy can pound a Gemini server if a lot of Web clients are accessing it
concurrently.  It should be able to find out whether the server is robust
to such operations or not.

By the same token, a public Gopher proxy (if there are any) should respect
"Disallow: gopherproxy".

Other points:
+1 for Allow:
+1 for Virtual-Agent
+1 for ignoring unknown lines
Unsure what the difference is between Crawl-Delay: and Check:, but having a
retry delay is a Good Thing

Additionally:  "Agent:" should specify a SHA-256 hash of the client cert
used by particular crawlers rather than a random easy-to-forge name.  Thus
GUS should crawl using a cert and publicly post the hash of this cert.
Then callers with that cert are necessarily GUS, since the cert itself is
not published.  (Of course it's still possible for a server to steal GUS's
client cert.)


> Maybe we could normalize robots fetching robots.txt with the query
> string set to some useful identifiying information? This would allow
> gemini administrators to make bot-specific rules, understand the
> behavior of their logs, and get in touch with the operator if
> necessary.
>

The trouble is that completely different pages can be returned with
different query strings that are entirely unrelated to actual searching, so
it's inappropriate to usurp the query string for this purpose.  That's not
to say that agent control can't rely on the query string.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Gules six bars argent on a canton azure 50 mullets argent
six five six five six five six five and six
   --blazoning the U.S. flag <http://web.meson.org/blazonserver>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201122/f4d1
e563/attachment.htm>

Link to individual message.

5. Adnan Maolood (me (a) adnano.co)

On Sun Nov 22, 2020 at 7:30 PM EST, John Cowan wrote:
> Additionally: "Agent:" should specify a SHA-256 hash of the client cert
> used by particular crawlers rather than a random easy-to-forge name.
> Thus
> GUS should crawl using a cert and publicly post the hash of this cert.
> Then callers with that cert are necessarily GUS, since the cert itself
> is
> not published. (Of course it's still possible for a server to steal
> GUS's
> client cert.)

This doesn't seem very useful, as bad robots can simply ignore the rules
in robots.txt.

Link to individual message.

6. John Cowan (cowan (a) ccil.org)

Of course they can: that's always true, as the pre-spec already says.   The
idea is to give crawlers (etc.) that want to keep to the rules some way to
clearly and uniquely identify themselves to servers.

On Sun, Nov 22, 2020 at 7:39 PM Adnan Maolood <me at adnano.co> wrote:

> On Sun Nov 22, 2020 at 7:30 PM EST, John Cowan wrote:
> > Additionally: "Agent:" should specify a SHA-256 hash of the client cert
> > used by particular crawlers rather than a random easy-to-forge name.
> > Thus
> > GUS should crawl using a cert and publicly post the hash of this cert.
> > Then callers with that cert are necessarily GUS, since the cert itself
> > is
> > not published. (Of course it's still possible for a server to steal
> > GUS's
> > client cert.)
>
> This doesn't seem very useful, as bad robots can simply ignore the rules
> in robots.txt.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201122/4c9e
06ec/attachment.htm>

Link to individual message.

7. Natalie Pendragon (natpen (a) natpen.net)

This looks great! I'm excited to see this companion spec become more
formalized, and really like the categorical virtual agent design. One
thing that stuck out to me after a first read was the `webproxy` user
agent. What would you think of something like the following instead:

`proxy`
`proxy-web`
`proxy-gopher`

The prefixed design doesn't have any drawbacks as far as I can tell,
and would allow for more intuitively designed blocking/allowing
hierarchies. E.g., if there were 4 different types of proxies in use,
and you only wanted to allow one, you could be more restrictive with
`proxy` and less restrictive with the more precise, suffixed user
agent of the type you are okay with (e.g., `proxy-gopher`).

Emphasis on "more intuitively designed" - I realize you could
technically accomplish this in the current design by simply adding
`proxy` to the mix, but I think the prefix-based organization makes it
clearer and a bit more intuitive.

Warm regards,
Natalie

Link to individual message.

8. Robert "khuxkm" Miles (khuxkm (a) tilde.team)

November 22, 2020 6:02 PM, "Drew DeVault" <sir at cmpwn.com> wrote:

> Feedback:
> 
> A web portal is a regular user agent, not a robot.

Just throwing in here for consideration that I agree with Drew, a proxy is 
not a robot by default. Are we implying that a browser must also follow 
robots.txt to be well-behaved? If so, I might just block AV-98 from 
reading my capsule. :)

What I would recommend in lieu of robots.txt proxy rules is normalizing 
using robots.txt on the web side of a proxy to prevent web spiders from 
inadvertantly crawling gemspace. For instance, proxy.vulpes.one blocks 
every robot user agent from indexing any part of the site.

Is there any good usecase for a proxy User-Agent in robots.txt, other than 
blocking web spiders from being able to crawl gemspace? If not, I would be 
in favor of dropping that part of the definition.

Just my two cents,
Robert "khuxkm" Miles

Link to individual message.

9. Sean Conner (sean (a) conman.org)

It was thus said that the Great Robert khuxkm Miles once stated:
> 
> Is there any good usecase for a proxy User-Agent in robots.txt, other than
> blocking web spiders from being able to crawl gemspace? If not, I would be
> in favor of dropping that part of the definition.

  I'm in favor of dropping that part of the definition as it doesn't make
sense at all.  Given a web based proxy at <https://example.com/gemini>, web
crawlers will check for <https://example.com/robots.txt> for guidance, not
<https://example.com/gemini?gemini.conman.org/robots.txt>.  Web crawlers
will not be able to crawl gemini space for two main reasons:

        1. Most server certificates are self-signed and opt out of the CA
           business.  And even if a crawler where to accept self-signed
          (or non-standard CA signed) certificates, then---

        2. The Gemini protocol is NOT HTTP, so all such HTTP requests will
           fail anyway.

  -spc

Link to individual message.

10. Robert "khuxkm" Miles (khuxkm (a) tilde.team)

November 22, 2020 9:05 PM, "Sean Conner" <sean at conman.org> wrote:

> It was thus said that the Great Robert khuxkm Miles once stated:
> 
>> Is there any good usecase for a proxy User-Agent in robots.txt, other than
>> blocking web spiders from being able to crawl gemspace? If not, I would be
>> in favor of dropping that part of the definition.
> 
> I'm in favor of dropping that part of the definition as it doesn't make
> sense at all. Given a web based proxy at <https://example.com/gemini>, web
> crawlers will check for <https://example.com/robots.txt> for guidance, not
> <https://example.com/gemini?gemini.conman.org/robots.txt>. Web crawlers
> will not be able to crawl gemini space for two main reasons:
> 
> 1. Most server certificates are self-signed and opt out of the CA
> business. And even if a crawler where to accept self-signed
> (or non-standard CA signed) certificates, then---
> 
> 2. The Gemini protocol is NOT HTTP, so all such HTTP requests will
> fail anyway.
> 
> -spc

Well, the argument is that the crawler would access 
<https://example.com/gemini?gemini://gemini.conman.org/>, and from there 
it could access 
<https://example.com/gemini?gemini://zaibatsu.circumlunar.space/>, and 
then <https://example.com/gemini?gemini://gemini.circumlunar.space/>, and 
so on. However, I'd argue that the onus falls on example.com to set a 
robots.txt rule in <https://example.com/robots.txt> to prevent web 
crawlers from indexing anything with their proxy.

Just my two cents,
Robert "khuxkm" Miles

Link to individual message.

11. Drew DeVault (sir (a) cmpwn.com)

A web portal is a one-to-one mapping of a user request to a gemini
request. It's not an automated process. It's a genuine user agent, an
agent of a user. The level of traffic you'd receive from a web portal is
similar to the amount of traffic you'd receive from any other user
agent, and rate controls or access blocking don't make sense.

As the maintainer of such a web portal, I officially NACK any suggestion
that it should obey robots.txt, and will not introduce such a feature.

Link to individual message.

12. Sean Conner (sean (a) conman.org)

It was thus said that the Great Drew DeVault once stated:
> A web portal is a one-to-one mapping of a user request to a gemini
> request. It's not an automated process. It's a genuine user agent, an
> agent of a user. The level of traffic you'd receive from a web portal is
> similar to the amount of traffic you'd receive from any other user
> agent, and rate controls or access blocking don't make sense.
> 
> As the maintainer of such a web portal, I officially NACK any suggestion
> that it should obey robots.txt, and will not introduce such a feature.

  What's the IP address of your web portal, so I can block it and prevent
the various webbots that will go through your web portal and index the
Gemini content without my consent?

  -spc

Link to individual message.

13. Robert "khuxkm" Miles (khuxkm (a) tilde.team)

November 22, 2020 10:31 PM, "Sean Conner" <sean at conman.org> wrote:

> It was thus said that the Great Drew DeVault once stated:
> 
>> A web portal is a one-to-one mapping of a user request to a gemini
>> request. It's not an automated process. It's a genuine user agent, an
>> agent of a user. The level of traffic you'd receive from a web portal is
>> similar to the amount of traffic you'd receive from any other user
>> agent, and rate controls or access blocking don't make sense.
>> 
>> As the maintainer of such a web portal, I officially NACK any suggestion
>> that it should obey robots.txt, and will not introduce such a feature.
> 
> What's the IP address of your web portal, so I can block it and prevent
> the various webbots that will go through your web portal and index the
> Gemini content without my consent?
> 
> -spc

I assume Drew's smart enough to block web bots from crawling his gemini 
portal. Just saying.

Just my two cents,
Robert "khuxkm" Miles

Link to individual message.

14. Drew DeVault (sir (a) cmpwn.com)

On Sun Nov 22, 2020 at 10:31 PM EST, Sean Conner wrote:
> What's the IP address of your web portal, so I can block it and prevent
> the various webbots that will go through your web portal and index the
> Gemini content without my consent?

It's not an indexer. It's a user agent. And its IP address is
173.195.146.137.

Dick.

Link to individual message.

15. John Cowan (cowan (a) ccil.org)

On Sun, Nov 22, 2020 at 10:07 PM Drew DeVault <sir at cmpwn.com> wrote:

A web portal is a one-to-one mapping of a user request to a gemini
> request. It's not an automated process. It's a genuine user agent, an
> agent of a user.
>

It is the agent of an arbitrarily large number of users.  That's the
difference between, say, an email user agent and an email gateway to a
non-Internet email system.  There is no reason to impose even soft
regulation on the former.  There is every reason to allow regulation of the
latter.


John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
The experiences of the past show that there has always been a discrepancy
between plans and performance.        --Emperor Hirohito, August 1945
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201122/7e1d
280a/attachment.htm>

Link to individual message.

16. Sean Conner (sean (a) conman.org)

It was thus said that the Great Robert khuxkm Miles once stated:
> November 22, 2020 10:31 PM, "Sean Conner" <sean at conman.org> wrote:
> 
> > It was thus said that the Great Drew DeVault once stated:
> > 
> >> A web portal is a one-to-one mapping of a user request to a gemini
> >> request. It's not an automated process. It's a genuine user agent, an
> >> agent of a user. The level of traffic you'd receive from a web portal is
> >> similar to the amount of traffic you'd receive from any other user
> >> agent, and rate controls or access blocking don't make sense.
> >> 
> >> As the maintainer of such a web portal, I officially NACK any suggestion
> >> that it should obey robots.txt, and will not introduce such a feature.
> > 
> > What's the IP address of your web portal, so I can block it and prevent
> > the various webbots that will go through your web portal and index the
> > Gemini content without my consent?
> > 
> > -spc
> 
> I assume Drew's smart enough to block web bots from crawling his gemini
> portal. Just saying.
> 
> Just my two cents,

  Drew's proxy is a webserver in its own right:

	https://git.sr.ht/~sircmpwn/kineto/tree/master/main.go

  It checks for a GET request for "/favicon.ico" but not to "/robots.txt".
Every other GET request is immediately proxied to a gemini server.  I think
it was meant to run locally, but he made an instance available on the public
Internet.

  -spc

Link to individual message.

17. Drew DeVault (sir (a) cmpwn.com)

On Sun Nov 22, 2020 at 11:51 PM EST, John Cowan wrote:
> It is the agent of an arbitrarily large number of users.

So is every other user agent. It will never make more requests than
there are users who are asking for content. It is not special.

Link to individual message.

18. Emilis (emilis (a) emilis.net)

On 11/23/20 2:30 AM, John Cowan wrote:
>
> By the same token, a public Gopher proxy (if there are any) should 
> respect "Disallow: gopherproxy".
>
> Other points:
> +1 for Allow:
> +1 for Virtual-Agent
> +1 for ignoring unknown lines
> Unsure what the difference is between Crawl-Delay: and Check:, but 
> having a retry delay is a Good Thing

A small nit-pick: if we use "Virtual-Agent" and "Crawl-Delay", we should 
at least use "gopher-proxy" instead of "gopherproxy".


--
Emilis Dambauskas
gemini://tilde.team/~emilis/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201124/de97
bb6b/attachment.htm>

Link to individual message.

19. Drew DeVault (sir (a) cmpwn.com)

-1 to Virtual-Agent

I think that this is best formalized as an addendum to the existing
robots.txt conventions, which simply details a gemini-specific
interpretation as such.

Link to individual message.

20. marc (marcx2 (a) welz.org.za)

Hi

I suppose I am chipping it a bit too late here, but I think
the robots.txt thing was always a rather ugly mechanism - a
bit of an afterthought.

Consider the gemini://example.com/~somebody/personal.gmi -
if somebody wishes to exclude personal.gmi from being
crawled they need write access to example.com/robots.txt,
and how do we go about making sure that ~somebodyelse,
also on example.com doesn't overwrite robots.txt with
their own rules ?

Then there is the problem of transitivity - if we
have a portal, proxy or archive - how does it relay
the information to its downstream users ? See also
the exchange between Sean and Drew...

So the way I remember it, robots.txt was a quick hack
to prevent spiders getting trapped in a maze of
cgi generated data, and so hammering the server.
It wasn't designed to solve matters of privacy
and redistribution.

I have pitched this idea before: I think a footer containing
the license/rules under which a page can be distributed/cached
is more sensible than robots.txt. This approach is:



I speak under correction, but I believe a decent amount of the
public web was mined for faces to train the neural networks
that now make totalitarian surveillance possible. Had these
been labelled "CC ND (no derivative work)" then there
would be legal impediment - not to the regimes now, but to
the universities and research labs which pioneered this.

We now have people more aware of this problem, and some
of us wish to put up material limited to gemini-space only,
and not export it to the web. A footer line "-- GMI: A. User"
could prohibit export to the web, while one "-- CC-SA: J. Soap"
would permit it...

regards

marc

Link to individual message.

21. Johann Galle (johann (a) qwertqwefsday.eu)


On 24.11.2020, marc wrote:
> I suppose I am chipping it a bit too late here, but I think
> the robots.txt thing was always a rather ugly mechanism - a
> bit of an afterthought.

+1 that the robots.txt solution feels a lot like a hack.
  
> So the way I remember it, robots.txt was a quick hack
> to prevent spiders getting trapped in a maze of
> cgi generated data, and so hammering the server.
> It wasn't designed to solve matters of privacy
> and redistribution.

There is a more modern alternative to robots.txt which is the X-Robots-Tag
HTTP header and sounds like what you are trying to do here.

That said, there are probably people who will not want special headers to be
added [1], altough I personally think that something like you suggest would not
be that "exploitable". Especially because it is just part of the documents text.

[1] See the first sentence of ?2.4 of the Gemini FAQ
     gemini://gemini.circumlunar.space/docs/faq.gmi
     https://gemini.circumlunar.space/docs/faq.html

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201124/ebbb
436e/attachment.sig>

Link to individual message.

22. Philip Linde (linde.philip (a) gmail.com)

On Tue, 24 Nov 2020 11:29:02 +0100
marc <marcx2 at welz.org.za> wrote:

> Consider the gemini://example.com/~somebody/personal.gmi -
> if somebody wishes to exclude personal.gmi from being
> crawled they need write access to example.com/robots.txt,
> and how do we go about making sure that ~somebodyelse,
> also on example.com doesn't overwrite robots.txt with
> their own rules ?

How the server produces responses to robots.txt requests is an
implementation detail. robots.txt can easily be implemented such that
the server responds with access information provided by files in
subdirectories. For example: a system directory corresponding to
/~somebody/ contains a file named ".disallow" containing
"personal.gmi". When the server builds a response to /robots.txt, it
considers the content of all ".disallow" files and includes Disallow
lines corresponding to their content. This way, individual users on a
multi-user system can decide for themselves the access policy for their
content without shared access to a canonical robots.txt.

> I have pitched this idea before: I think a footer containing
> the license/rules under which a page can be distributed/cached
> is more sensible than robots.txt. This approach is:
> 
> * local to the page (no global /robots.txt)
> * persistent (survives being copied, mirrored & re-exported)
> * sound (one knows the conditions under which this can be redistributed)

What if my document is a binary file of some sort that I can not add a
footer to? The only ways to address this consistently for all document
types are to

a) Include the information in the response, *distinct* from its body
b) Provide the information in a sidecar file or sideband communication
   channel

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201124/fb81
9cb7/attachment.sig>

Link to individual message.

23. Nick Thomas (gemini (a) ur.gs)

Hi,

On Sun, 2020-11-22 at 17:31 +0100, Solderpunk wrote:
> Hi folks,
> 
> There is now (finally!) an official reference on the use of
> robots.txt
> files in Geminispace.  Please see:
> 
> gemini://gemini.circumlunar.space/docs/companion/robots.gmi

Thanks for this. One change that I'd be interested in is adding a
statement that if there is no `robots.txt` for the site, we assume an
implicit disallow-all for all the virtual-agents except proxies.

Presumed consent, with opt-outs for the tiny minority of people who
have the time and mental space to work out how to get those opt-outs to
apply, is standard behaviour on the web, but it's not behaviour I like.
GitHub recently dumped code of mine into an arctic vault, for instance;
the archive.org snapshots of geminispace have similar dynamics. We can
do better by asking people to opt *in* to these kinds of things if they
want it, rather than to opt *out* if they don't.

I exclude Virtual-Agent: webproxy here because the likely use of such a
proxy is transient, rather than persistent. It seems odd to me that it
sits alongside indexing, archival, and research, all of which lead to
durable artifacts on success. It does complicate things a little to
treat it differently, thought.

Thoughts? I appreciate this would impact on the ability of archivists
or researchers to capture geminispace, but I see that as a feature,
rather than an unfortunate side-effect :). 

/Nick

Link to individual message.

24. A. E. Spencer-Reed (easrng (a) gmail.com)

On Tue, Nov 24, 2020 at 6:42 AM Nick Thomas <gemini at ur.gs> wrote:

> Thoughts? I appreciate this would impact on the ability of archivists
> or researchers to capture geminispace, but I see that as a feature,
> rather than an unfortunate side-effect :).

I don't agree with archiving being disallowed by default. archive.org
and others have saved me so many times, I can't imagine why one would
not want an archive. If there is a reason I would much prefer an
opt-out system for it.
Why do you dislike archival?

Link to individual message.

25. James Tomasino (tomasino (a) lavabit.com)

Just an FYI on the recent discussion around implied license for search 
engines and archival: These aren't rules baked into a spec, they're 
implications of the DMCA in the US and relevant case law, such as BLAKE A. 
FIELD vs GOOGLE (2016). The existence of a mechanism to disallow indexing 
was vital to that decision establishing implied license. Search engines, 
whether they be our lovely friend GUS or some future behemoth, can gather, 
index, and cache as they see fit because there is a mechanism for you to 
say no. That mechanism is the robots.txt and they have a strong case 
saying that the rules which govern it are already well established.

As much as I'd love to wave a magic wand and say, "it's all opt-in here" 
we don't really have any legal footing to do so.

Link to individual message.

26. Jason McBrayer (jmcbray (a) carcosa.net)

"Drew DeVault" <sir at cmpwn.com> writes:

> A web portal is a one-to-one mapping of a user request to a gemini
> request. It's not an automated process. It's a genuine user agent, an
> agent of a user.

I believe the concern is not that a web portal will archive pages, or
run on its own as an automated process, but that it will be used by a
third-party web bot (i.e., one not run by the owner of the portal) to
crawl Gemini sites and index them on the web.

> As the maintainer of such a web portal, I officially NACK any
> suggestion that it should obey robots.txt, and will not introduce such
> a feature.

It seems to me that the correct thing is for people that run web portals
to have a very strong robots.txt on /their/ web site, and additionally,
to be proactive about blocking web bots that don't observe robots.txt. I
think people want to block web portals in their Gemini robots.txt
because they don't trust web portal authors to do those two things. I
understand the feeling, but they're still trusting web portal authors to
obey robots.txt, which is honestly more work.

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |

Link to individual message.

27. Drew DeVault (sir (a) cmpwn.com)

On Tue Nov 24, 2020 at 9:06 AM EST, Jason McBrayer wrote:
> I believe the concern is not that a web portal will archive pages, or
> run on its own as an automated process, but that it will be used by a
> third-party web bot (i.e., one not run by the owner of the portal) to
> crawl Gemini sites and index them on the web.

Aha, this is a much better point. One which should probably be addressed
in the robots.txt specification.

> It seems to me that the correct thing is for people that run web portals
> to have a very strong robots.txt on /their/ web site, and additionally,
> to be proactive about blocking web bots that don't observe robots.txt. I
> think people want to block web portals in their Gemini robots.txt
> because they don't trust web portal authors to do those two things. I
> understand the feeling, but they're still trusting web portal authors to
> obey robots.txt, which is honestly more work.

Web portals are users, plain and simple. Anyone who blocks a web portal
is blocking legitimate users who are engaging in legitimate activity.
This is a dick move and I won't stand up for anyone who does it.

However, the issue of web crawlers hitting geminispace through a web
portal is NOT that, and I'm glad you brought it up. I'm going to forbid
web crawlers from crawling my gemini portal.

Link to individual message.

28. James Tomasino (tomasino (a) lavabit.com)

On 11/24/20 1:15 PM, A. E. Spencer-Reed wrote:
> On Tue, Nov 24, 2020 at 6:42 AM Nick Thomas <gemini at ur.gs> wrote:
> 
>> Thoughts? I appreciate this would impact on the ability of archivists
>> or researchers to capture geminispace, but I see that as a feature,
>> rather than an unfortunate side-effect :).
> I don't agree with archiving being disallowed by default. archive.org
> and others have saved me so many times, I can't imagine why one would
> not want an archive. If there is a reason I would much prefer an
> opt-out system for it.
> Why do you dislike archival?

Denying archival is already possible with robots.txt in its present form. 
We don't need to edit the spec for that either. If you want to avoid the 
internet archive you can use:

User-agent: ia_archiver
Disallow: /

Link to individual message.

29. marc (marcx2 (a) welz.org.za)

Hi

> How the server produces responses to robots.txt requests is an
> implementation detail. robots.txt can easily be implemented such that
> the server responds with access information provided by files in
> subdirectories. For example: a system directory corresponding to
> /~somebody/ contains a file named ".disallow" containing
> "personal.gmi". When the server builds a response to /robots.txt, it
> considers the content of all ".disallow" files and includes Disallow
> lines corresponding to their content. This way, individual users on a
> multi-user system can decide for themselves the access policy for their
> content without shared access to a canonical robots.txt.

Note that the apache people worry about just doing a
stat() for .htaccess along a path. This proposal requires an
opendir() for *every* directory in the exported hierarchy.

I concede that this isn't impossible - it is potentially expensive,
messy or nonstandard (and yes, there are inotify tricks or
serving the entire site out of a database, but that isn't a
common thing).

> > I have pitched this idea before: I think a footer containing
> > the license/rules under which a page can be distributed/cached
> > is more sensible than robots.txt. This approach is:
> > 
> > * local to the page (no global /robots.txt)
> > * persistent (survives being copied, mirrored & re-exported)
> > * sound (one knows the conditions under which this can be redistributed)
> 
> What if my document is a binary file of some sort that I can not add a
> footer to? The only ways to address this consistently for all document
> types are to
> 
> a) Include the information in the response, *distinct* from its body
> b) Provide the information in a sidecar file or sideband communication
>    channel

So I think this is the interesting bit of the discussion -
the tradeoff of keeping this information inside the file or
in a sidechannel. You are of course correct that not every
file format permits embedding such information, and that
is the one side of the tradeoff.... the other side is
the argument for persistence - having the data in another
file (or in a protocol header) means that is likely to be
lost.

And my view is that caching/archiving/aggregating/protocol
translation all involve making copies, where a careless or
inconsiderate intermediate is likely to discard information
not embedded in the file. For instance, if a web frontend
serves gemini://example.org/private.gmi as
https://example.com/gemini/example.org/private.gmi
how good are the odds that this frontend fetches
gemini://example.org/robots.txt, rewrites the urls in there
from /private.gmi to /gemini/example.org/private.gmi and
merges it into its own /robots.txt ? And does it before
any crawler request is made... 

A pragmatist's argument: The web and geminispace are a graph
of links, and all the interior nodes have to be markup, so those
are covered, and they control the reachability - without
a link you can't get to the terminal/leaf node. And even if
this is bypassed (robots.txt isn't really a defence against hotlinking
either) most other terminal nodes are images or video, which typically have
ways of adding meta information (exif, etc).

regards

marc

Link to individual message.

30. Nick Thomas (gemini (a) ur.gs)

(I could be a lot better at using mailing lists. I think this message
was sent privately in error).

On Tue, 2020-11-24 at 08:15 -0500, A. E. Spencer-Reed wrote:
> Why do you dislike archival?

Thanks for weighing in!

In short, because the purposes to which the archive can be put, and the
motives of the archiver, are not clear at time of robots.txt-mediated
archival.

For myself, I'm happy with some types of archival, and not happy with
some other types. Some people would be happy to be included in every
archive going; others, in none of them. Given this variability, we must
take a stance on what to assume if robots.txt isn't present. I also I
don't think this variability is amenable to capture with more fine-
grained virtual agents. 

The current internet-draft for robots.txt says, in 2.2.1:

>  If no group satisfies either condition, or no groups are present at
> all, no rules apply.

( https://tools.ietf.org/html/draft-koster-rep-00 )

This is pretty standard on the Web and, entirely coincidentally, a huge
boon to Google et al. Importing robots.txt the way we do in the
companion specification also imports this line.

However, unlike the Web, Gemini "takes user privacy very seriously".
Archives *can* be injurious to user privacy - if you need convincing on
this point, there are a range of cases and examples around GDPR "right
to be forgotten" stuff. To my perspective, Gemini is important a line
from the internet-draft that is directly contrary to its mission.

Combining Gemini's mission with that realisation means that if no
statement has been made about whether the given user (server operator
in this specific case) is OK with their content being archived, the
presumption should be that they are not OK with it. We should value
user privacy above archiver convenience.

In affect, we add a second exception to the protocal that amends 2.2.1
to end "if no rules are specified, this robots.txt file MUST be
assumed".

On a practical level, being excluded from search engines by-default
drives the discoverability of robots.txt, and server software could
easily include flags like --permit-indexing or --permit-archival to
streamline that discoverability. I don't think that opt-in rates would
be similar to current opt-out rates on the Web.

/Nick

Link to individual message.

31. Johann Galle (johann (a) qwertqwefsday.eu)

Nick Thomas wrote:> I don't think that opt-in rates would be similar to 
current opt-out rates
> on the Web.
This can probably be summed up with one question:
Why do we want a robots.txt in the first place? After all, if there were no
reasons against archival et al., we would not need a robots.txt at all. And
IMHO this also is the reason why it should rather be an opt-in system.

-- 
You can verify the digital signature on this email with the public key available
through web key discovery.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201124/944d
d2b6/attachment.sig>

Link to individual message.

32. Nick Thomas (gemini (a) ur.gs)

On Tue, 2020-11-24 at 13:31 +0000, James Tomasino wrote:
> 
> As much as I'd love to wave a magic wand and say, "it's all opt-in
> here" we don't really have any legal footing to do so.
> 

James and I talked a bit more about this one on IRC. Key to this
argument, AIUI, is how robots.txt (or the lack of it) is treated for
FTP, which lacks any mention of it in the spec but has apparently been
given weight in DMCA-related rulings involving it.

I'm not sure I agree with the reasoning, which goes something like "the
robots.txt Internet-Draft is already de-jure part of Gemini, and we
can't change that", but IANAL ^^. In particular, I've been thinking
about this almost entirely in GDPR terms so far, and have a bunch of
DMCA-related reading to do now.

In the event that it *is* accurate, we talked about an alternative way
to implement the functionality.  Rather than having the gemini
robots.txt spec say "if the client doesn't receive a robots.txt, it
must assume this one", the *server* could be made to return a defined
robots.txt response body if it would otherwise issue a 51 response to
`/robots.txt`

(51 may be too specific, it could be 5x, but I don't *think* it would
be appropriate in response to 4x responses, which crawlers would be
expected to retry).

Of course, any server could do that already today, so the ask is to put
a recommendation about it into "server best practice", perhaps
incorporating the `--permit-indexing` and `--permit-archiving` flags I
talked about in another post.

Another advantage of this approach is that it becomes opaque to crawler
authors whether the user has explicitly selected a preference or not.
I'm also inclined to trust server implementors over crawler
implementors.

/Nick

p.s. there was also some question as to whether someone hosting gemini
content was a "gemini user", in the way we use that term on the project
homepage. To me, it seems like a reasonable extrapolation, but perhaps
it's a topic that deserves more debate or clarification.

Link to individual message.

33. James Tomasino (tomasino (a) lavabit.com)

On 11/24/20 5:12 PM, Nick Thomas wrote:
> On Tue, 2020-11-24 at 13:31 +0000, James Tomasino wrote:
>> As much as I'd love to wave a magic wand and say, "it's all opt-in
>> here" we don't really have any legal footing to do so.
>>
> James and I talked a bit more about this one on IRC. Key to this
> argument, AIUI, is how robots.txt (or the lack of it) is treated for
> FTP, which lacks any mention of it in the spec but has apparently been
> given weight in DMCA-related rulings involving it.
> 
> I'm not sure I agree with the reasoning, which goes something like "the
> robots.txt Internet-Draft is already de-jure part of Gemini, and we
> can't change that", but IANAL ^^. In particular, I've been thinking
> about this almost entirely in GDPR terms so far, and have a bunch of
> DMCA-related reading to do now.

In addition to FTP, gopher adopted the robots.txt standard almost immediately:

https://groups.google.com/g/comp.internet.net-happenings/c/Iv8ylGxvoh8?pli=1

You can read the IETF spec for the Robots Exclusion Protocol here:
https://tools.ietf.org/html/draft-rep-wg-topic-00

As you'll note in "2.3.  Access method", their documentation isn't scheme 
specific and they even list FTP as a valid option.

This is the document that will be used in court by anyone defending an 
indexer and any exclusion you want to obtain for Gemini would need to 
happen there. Having a contradictory statement in the Gemini spec will not 
stand up against the history and precedence of this one.

If you want to implement stronger protections in Gemini then I'd suggest 
adding a note in the best-practices document for server creators to (as 
Nick suggested) serve a robots.txt if no such file exists with the contents:

User-agent: *
Disallow: /

That achieves your aim of block-by-default and the opt-in would be the 
creation of a robots.txt file of your own.

Link to individual message.

34. Solderpunk (solderpunk (a) posteo.net)

On Tue Nov 24, 2020 at 3:07 PM CET, Drew DeVault wrote:

> Web portals are users, plain and simple. Anyone who blocks a web portal
> is blocking legitimate users who are engaging in legitimate activity.
> This is a dick move and I won't stand up for anyone who does it.

This has actually long been a bit of a contentious point in the
Gopherverse, and we have inherited a bit of the controversy, if I
remember much earlier discussions accurately.  There are some people
(a vocal minority?  I'm not sure), who feel that public web proxies
exposing their Gopherhole/capsule to the entire browser-using world are
negating the agency they exercised in very deliberately putting some
content up only on Gopher/Gemini and not the web.  Web proxies force
them to be visible in (and linkable from) a space that they are actively
trying not to participate in.

While I am aware of the ultimate futility of trying to control where
publically served online content ends up, I have some sympathy for this
perspective (perhaps even more so now that we have very nice tools like
your own Kineto by which people who *do* want their content to be
accessible from a browser can achieve this easily).  When the first web
portals for Gemini turned up, some people expressed interest in being
able to opt out, to keep their Gemini-only content truly Gemini-only,
and at least one of those early web portals (portal.mozz.us) agreed to
respect those wishes.  The webproxy user agent I put into the first
robots.txt draft is actually just codifying what portal.mozz.us has
already been doing for many months.  I did not expect its inclusion to
be so controversial.  I *did* try to word it carefully so that personal
webproxies which, e.g. run on a user's local machine and are not
publically accessible need not abide by robots.txt, as those are really
just roundabout Gemini clients.

Cheers,
Solderpunk

Link to individual message.

35. Robert "khuxkm" Miles (khuxkm (a) tilde.team)

I am personally against this idea of forcing (or even normalizing) 
browsers giving special treatment to a request for a URL based on what the 
server would normally respond (I'm not even going to entertain the idea of 
pretending the internet draft doesn't apply to us). This is what I assume 
it would look like in spec (or best practices, or wherever you want to put it):

> When a client makes a request for a URI with a path component of 
"/robots.txt", and the server would normally respond to such a request 
with a 51 Not Found status code, it should instead respond with a 20 
status code, a MIME type of text/plain, and content of "User-Agent: 


Doesn't that just *feel* like a hack to you?

I did some research with GUS's known-hosts list. Of the 362 hosts known to 
GUS, only 36 have a robots.txt file, so any choice made as to what the 
default robots.txt should be will affect around 90% of Geminispace (not to 
mention any new hosts to come). Notably, of the 36 hosts to impose a 
robots.txt, 7 of them completely block archiving (although that number is 
skewed, as I know that at least 3 of those hosts are ran by the same 
person, and 2 of those hosts are ran by another person). This means that 
anywhere between 2% (all of the hosts who don't have a robots.txt are fine 
with being archived) to 20% (the sample of people who have robots.txt is 
representative of the whole population), or even 91% (everybody without a 
robots.txt doesn't want to be archived). I don't feel comfortable making a 
declaration either way, but this is food for thought.

Just my two cents,
Robert "khuxkm" Miles

Link to individual message.

36. Nick Thomas (gemini (a) ur.gs)

On Tue, 2020-11-24 at 19:08 +0000, Robert "khuxkm" Miles wrote:

> Doesn't that just *feel* like a hack to you?

It definitely feels hackish when worded like this :).

The precise technical form is secondary to the outcome (as I see it) of
protecting users from a privacy-hostile default in the robots.txt
specification. I appreciate that you're currently an opt-out, rather
than opt-in, advocate, but I'd still appreciate any ideas you have to
make it nicer *if* gemini ends up going for opt-in.

An alternative form that just came to mind is a server implementation
recommendation like this:

 ```
Geminispace crawlers use the /robots.txt request path to determine
whether a capsule can be accessed for archival, indexing, research, and
other purposes. This can have privacy implications for the user, so
servers should not start unless they have an explicit signal on how to
handle requests to the /robots.txt path.

For example, this signal may be the availability of any content for the
/robots.txt path, a user-added database entry indicating that the path
should receive a 5x response, or a non-default configuration parameter
specifying that it's OK to skip the check.

If no such signal is present, the server should emit an error message
and either exit immediately, or allow the user to specify how the path
should be handled.
 ```

As a new server operator with no idea about `robots.txt`, I'd run, say:

 ```
$ agate [::]:1995 mysite cert.pem key.rsa ur.gs

No robots.txt file present! Please create mysite/robots.txt, or re-run
Agate with --permit-robots to allow your content to be archived,
indexed, or otherwise used by automated crawlers of Geminispace
 ```


off to learn about this robots.txt thing; others might shrug and just
add the --permit-robots flag. 

>  Of the 362 hosts known to GUS, only 36 have a robots.txt file, so
> any choice made as to what the default robots.txt should be will
> affect around 90% of Geminispace 

Thanks for running the numbers on this. I agree with everything you
said based on them. That any change affects such a large proportion of
existing geminispace is especially worth emphasising.

/Nick

Link to individual message.

37. John Cowan (cowan (a) ccil.org)

On Tue, Nov 24, 2020 at 3:25 PM Nick Thomas <gemini at ur.gs> wrote:


> >  Of the 362 hosts known to GUS, only 36 have a robots.txt file, so
> > any choice made as to what the default robots.txt should be will
> > affect around 90% of Geminispace
>
> Thanks for running the numbers on this. I agree with everything you
> said based on them. That any change affects such a large proportion of
> existing geminispace is especially worth emphasising.
>

Why is that a Good Thing?  It's another piece of bureaucracy: 90% of hosts
were happy to be archived before, so now they have to write a robots.txt
file.  Although small for any one server operator, it is large when
multiplied by the number of servers there *will be*.  "Small Internet" does
not mean "Internet with only a few servers", AFAIK.

Two things about the Internet Archive:

1) It is a U.S. public library, which gives it special rights when it comes
to making copies.

2) Though it does not respect robots.txt, it is happy to make your content
invisible to archive users by informal request (or, of course, by a DCMA
takedown notice).



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Gules six bars argent on a canton azure 50 mullets argent
six five six five six five six five and six
   --blazoning the U.S. flag <http://web.meson.org/blazonserver>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201124/cf7c
7400/attachment.htm>

Link to individual message.

38. James Tomasino (tomasino (a) lavabit.com)

On 11/24/20 11:44 PM, John Cowan wrote:
> 2) Though it does not respect robots.txt, it is happy to make your
> content invisible to archive users by informal request (or, of course,
> by a DCMA takedown notice).

The Internet Archive does respect robots.txt, though they're not happy
about it and have written on the subject a few times. I included a
snippet in an earlier email with their user-agent.

Link to individual message.

39. Nick Thomas (gemini (a) ur.gs)

On Tue, 2020-11-24 at 18:44 -0500, John Cowan wrote:
> On Tue, Nov 24, 2020 at 3:25 PM Nick Thomas <gemini at ur.gs> wrote:
> 
> > Thanks for running the numbers on this. I agree with everything you
> > said based on them. That any change affects such a large proportion
> > of
> > existing geminispace is especially worth emphasising.
> > 
> 
> Why is that a Good Thing? 

I very intentionally *didn't* say it was a good thing :). There are
many ways to interpret the data, but I'm still glad we have it.

> It's another piece of bureaucracy: 90% of hosts
> were happy to be archived before

You're presuming consent here. We don't actually *know* that said 90%
of hosts are happy to be archived; we only know that 90% of hosts
haven't included a robots.txt file, which could be for any one of a
multitude of reasons.


files would actually prefer not to be included in archives when asked,
the current situation is not serving their privacy well, and gemini is
suppose to be protective of user privacy. *If* an overwhelming majority
of them simply don't care, then sure, the argument for it starts to
look a bit niche. Talking in IRC earlier today, I hand-waved a 5%
threshold for the first condition and 1% for the second.

A personal example: *I* didn't have a robots.txt on my capsule file
until today, but I don't want to be included in archives for various
reasons. Presuming consent from the lack of a robots.txt file would
have incorrectly guessed my preference, and harmed my privacy. Who else
in that 90% is like me? We don't know.

> so now they have to write a robots.txt
> file.  Although small for any one server operator, it is large when
> multiplied by the number of servers there *will be*.  "Small
> Internet" does
> not mean "Internet with only a few servers", AFAIK.

Yes, there is a convenience/privacy trade-off here. I interpret
gemini's mission to favour privacy over convenience when the two come
into conflict.

> Two things about the Internet Archive:
> 
> 1) It is a U.S. public library, which gives it special rights when it
> comes
> to making copies.

Certainly true, and there will be cases where, even when you do have
wonderfully hand-crafted robots.txt file like the one I made today, an
archiver determines that they can legally scrape you anyway. Others
will scrape illegally, whether through malice or ignorance.

Meanwhile, Google, the Internet Archive, and a bunch of other people
respect robots.txt even when they might not be legally *required* to
via GDPR-like provisions. A control doesn't have to be perfect to be
desirable. This argument comes up in the context of "right to be
forgotten" quite a lot ^^.

> 2) Though it does not respect robots.txt, it is happy to make your
> content
> invisible to archive users by informal request (or, of course, by a
> DCMA
> takedown notice).

As I understand it, archive.org does respect robots.txt in general, but
has exceptions for certain sites it's identified it has a public
interest justification for. That includes the US military, but probably
doesn't include any currently-existing gemini site.

/Nick

Link to individual message.

40. Nick Thomas (gemini (a) ur.gs)

(Received off-list, but I assume it was *meant* for the list, so
replying there)

On Wed, 2020-11-25 at 00:36 -0500, John Cowan wrote:
> 
> I understand "user privacy" to mean the privacy of people using
> clients.
> What privacy do server operators expect to have, unless they are
> using
> client certs, firewalls, or other such blockers?  Barring those, they
> are
> serving content to all the world.

Yes, I've got a p.s. somewhere on the list around this potential
objection.

I don't think that server operators (perhaps better: "capsule authors")
have been explicitly ruled in when talking about user privacy in gemini
so far; but I also don't think they've been explicitly ruled out - it
just hadn been a live issue until the first archiver showed up and
(presumably in response to that) the robots.txt spec was published.

I don't find it a stretch at all to see capsule authors as gemini
users, but if we were to end up excluding them from the category for
some reason, my proposal certainly looks a lot less interesting.

Whatever the outcome of the opt-in vs opt-out part of this discussion,
the robots.txt spec gives weight to the expressed preferences of
capsule authors.. Crawler authors are being asked to respect those
preferences, and one of the possible motivations for that is a
recognition that the privacy of capsule authors is harmed by not
respecting their preferences.

Saying "I want to be in search indexes but not archives" is likely to
be motivated by privacy concerns, and an explicit robots.txt is one way
that I, as a capsule author, can expect to have privacy from archives.
If it's true for people with an explicit preference, it can also be
true for people who haven't expressed a preference yet. Since Gemini
has a higher standard for user privacy than the web, it can also have a
higher standard for these preferences - one that does not rely on
presumed consent - if we want it to.

> > As I understand it, archive.org does respect robots.txt in general,
> 
> Not since 2018.  See <
> https://help.archive.org/hc/en-us/articles/360004651732-Using-The-Wayback-Machine>;,
> which was updated 5 days ago and says:

The FAQ immediately above the one you quoted reads:

> Why isn't the site I'm looking for in the archive?*

> Some sites may not be included because the automated crawlers were
> unaware of their existence at the time of the crawl. It's also 
> possible that some sites were not archived because they were 
> password protected, blocked by robots.txt, or otherwise inaccessible 
> to our automated systems. Site owners might have also requested that 
> their sites be excluded from the Wayback Machine.

If archive.org didn't respect robots.txt at all, it would lend a lot of
flavour to the "archiver" virtual user-agent idea in the companion
spec, in addition to this discussion. Do you still have doubts after
reading this section?

/Nick

Link to individual message.

41. John Cowan (cowan (a) ccil.org)

On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote:

> (Received off-list, but I assume it was *meant* for the list, so
> replying there)
>

It was, so thanks.  My private messages are labeled (Private message) at
the top because I make this mistake a lot.

Whatever the outcome of the opt-in vs opt-out part of this discussion,
>

That's the only part that concerns me.  A robots.txt spec is good and
crawlers/archivers that respect it are fine too, though of course some
won't.

I once wrote to the author of a magazine article who had published a simple
crawler that it would hammer whatever server it was crawling, since it did
not delay between requests or intersperse them with requests to other
servers, but simply walked the server's tree depth-first. and that it
should respect robots.txt.  He wrote back saying "That's the Internet
today; deal with it."  I could have answered (but I didn't) that hits are a
cost to the server operator, and anyone running his dumb crawler was not
only DDOSing, but spending my money for his own purposes.

But I do think that once robots.txt support is in place, no robots.txt = no
expressed preference.

If it's true for people with an explicit preference, it can also be
> true for people who haven't expressed a preference yet. Since Gemini
> has a higher standard for user privacy than the web, it can also have a
> higher standard for these preferences - one that does not rely on
> presumed consent - if we want it to.
>

By this logic, nobody should be able to access a Gemini server at all
unless the capsule author has expressed a preference for them to do so.
But to publish is to expose your work to the public.

> The FAQ immediately above the one you quoted reads:
>
> > Why isn't the site I'm looking for in the archive?*
>
> > Some sites may not be included because the automated crawlers were
> > unaware of their existence at the time of the crawl. It's also
> > possible that some sites were not archived because they were
> > password protected, blocked by robots.txt, or otherwise inaccessible
> > to our automated systems. Site owners might have also requested that
> > their sites be excluded from the Wayback Machine.
>

I interpret that to mean that some sites were not crawled during the period
when the Archive was paying attention to robots.txt, and so their content
as of that date is unavailable.  Note the past tense:  "were [...]
protected by robots.txt" as opposed to "are protected".

> If archive.org didn't respect robots.txt at all, it would lend a lot of
> flavour to the "archiver" virtual user-agent idea in the companion
> spec, in addition to this discussion. Do you still have doubts after
> reading this section?
>

I have no doubt whatever that the crawler doesn't respect robots.txt.  I
could do a little experiment, though.

John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
The competent programmer is fully aware of the strictly limited size of his
own
skull; therefore he approaches the programming task in full humility, and
among
other things he avoids clever tricks like the plague.  --Edsger Dijkstra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201125/7813
ec46/attachment-0001.htm>

Link to individual message.

42. Nick Thomas (gemini (a) ur.gs)

On Wed, 2020-11-25 at 10:10 -0500, John Cowan wrote:
> On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote:
> > If it's true for people with an explicit preference, it can also be
> > true for people who haven't expressed a preference yet. Since
> > Gemini
> > has a higher standard for user privacy than the web, it can also
> > have a
> > higher standard for these preferences - one that does not rely on
> > presumed consent - if we want it to.
> > 
> 
> By this logic, nobody should be able to access a Gemini server at all
> unless the capsule author has expressed a preference for them to do
> so.
> But to publish is to expose your work to the public.

Browsing, indexing, research crawling, and archiving, are all distinct
things with distinct impacts on capsule author privacy. This is why my
opening email proposed that we retain presumed consent for browing via
a proxy - it's a clear case of "one of these things is not like the
others", and the same is true for individual browsing.

This section was mostly aimed at establishing that capsule authors
should be thought of as gemini users, so it took some shortcuts on
presumed consent verbiage, which might not have been helpful.

For clarity: I think it's fine to presume consent for browsing (whether
through a proxy or not), and not fine to presume consent for archiving.
If adopted, this represents a significant enhancement to capsule author
privacy compared to web norms.

Presumed consent for indexing, I'm actually fairly marginal about. I do
think it's more appropriate to forbid than permit it, but not very
strongly.

> > > Why isn't the site I'm looking for in the archive?*
> > > Some sites may not be included because the automated crawlers
> > > were
> > > unaware of their existence at the time of the crawl. It's also
> > > possible that some sites were not archived because they were
> > > password protected, blocked by robots.txt, or otherwise
> > > inaccessible
> > > to our automated systems. Site owners might have also requested
> > > that
> > > their sites be excluded from the Wayback Machine.
> 
> I interpret that to mean that some sites were not crawled during the
> period
> when the Archive was paying attention to robots.txt, and so their
> content
> as of that date is unavailable.  Note the past tense:  "were [...]
> protected by robots.txt" as opposed to "are protected".

I don't see any space at all to read it like that, not least due to the
references to "password protected" and "otherwise inaccessible" content
in exactly the same tense.

To me, it's crystal clear that the past tense is used here simply
because the crawl happened in the past.

I do have this blog post from April 2018, referencing archived blogs
from December 2017, where robots.txt being respected is a plot point:

https://blog.archive.org/2018/04/24/addressing-recent-claims-of-manipulated
-blog-posts-in-the-wayback-machine/

The blog.archive.org rant about robots.txt not being suitable for
archivers was April 2017. that's the one that mentions they may not
respect robots.txt in-general in the future; I'd really very strongly
expect a futher blog post to appear if they start taking steps in that
direction.

It would definitely be interesting if you had an experiment or
reference demonstrating that archive.org ignores robots.txt in general,
but this page simply isn't it.

/Nick

Link to individual message.

43. Philip Linde (linde.philip (a) gmail.com)

On Tue, 24 Nov 2020 16:16:49 +0100
marc <marcx2 at welz.org.za> wrote:

> Note that the apache people worry about just doing a
> stat() for .htaccess along a path. This proposal requires an
> opendir() for *every* directory in the exported hierarchy.

Apache is designed to be able to serve large enterprises with high
request loads. The cause for their concern seems unlikely to apply to
multi-user Gemini hosts.

> I concede that this isn't impossible - it is potentially expensive,
> messy or nonstandard (and yes, there are inotify tricks or
> serving the entire site out of a database, but that isn't a
> common thing).

It's very much a matter of implementation. For example, if high
performance is a concern you can regenerate the information once per
minute rather than on a per-request basis, or on request from the users,
via a Gemini endpoint.

That's however a good argument for an Allow directive corresponding to
Disallow, to be able to disallow by default and only allowing resources
lower down in the hierarchy explicitly, which allows for a "better safe
than sorry" approach to "prevent" a crawler from picking up resources
before the new robot rules have been picked up.

> So I think this is the interesting bit of the discussion -
> the tradeoff of keeping this information inside the file or
> in a sidechannel. You are of course correct that not every
> file format permits embedding such information, and that
> is the one side of the tradeoff.... the other side is
> the argument for persistence - having the data in another
> file (or in a protocol header) means that is likely to be
> lost.

What you're proposing is doubly effective in that data that isn't there


I appreciate your point, but "not every file format" is an
understatement. It's really only one file format that is controlled by
the Gemini spec right now: text/gemini. That's where we could add such
information and define it to be meaningful.

> And my view is that caching/archiving/aggregating/protocol
> translation all involve making copies, where a careless or
> inconsiderate intermediate is likely to discard information
> not embedded in the file.

A careless or inconsiderate intermediate is likely to discard
information, full stop. It's only respectful and considerate robots
that will recognize either approach.

> For instance, if a web frontend
> serves gemini://example.org/private.gmi as
> https://example.com/gemini/example.org/private.gmi
> how good are the odds that this frontend fetches
> gemini://example.org/robots.txt, rewrites the urls in there
> from /private.gmi to /gemini/example.org/private.gmi and
> merges it into its own /robots.txt ? And does it before
> any crawler request is made...

On the other hand, how likely is it that a web crawler will interpret
robot instructions from text/gemini-turned-html documents?

> A pragmatist's argument: The web and geminispace are a graph
> of links, and all the interior nodes have to be markup, so those
> are covered, and they control the reachability - without
> a link you can't get to the terminal/leaf node. And even if
> this is bypassed (robots.txt isn't really a defence against hotlinking
> either) most other terminal nodes are images or video, which typically have
> ways of adding meta information (exif, etc).

Do you propose to standardize extensions to Exif/ID3/Vorbis
comments/PDF metadata etc. as well as text/gemini? Neither these
currently have a standard way to specify a robots policy; it seems
understood that it's not a concern of the file itself whether a crawler
should be able to download it if the file is ever served over a
crawlable graph.

Hotlinking is a different concern altogether. The purpose of robots.txt
is not to disallow hotlinking.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201125/d706
df49/attachment.sig>

Link to individual message.

44. Sean Conner (sean (a) conman.org)

It was thus said that the Great Nick Thomas once stated:
> > 
> For clarity: I think it's fine to presume consent for browsing (whether
> through a proxy or not), and not fine to presume consent for archiving.
> If adopted, this represents a significant enhancement to capsule author
> privacy compared to web norms.

  The issue with proxying (especially via the web) is the web side.  Using a
webproxy that runs locally to browse Gemini sites via a browser is fine, but
it becomes problematic if said proxy is listening on a public IP address. 
It's not a matter of *if* but *WHEN* webbots of all types start hitting it,
and *those* are a mixture of indexer, archiver, research and other [1].  At
the very least, any web proxy should respond to "/robots.txt" and either
serve up a file, or have command line options to generate a response to
"/robots.txt" or at the very least (or as a default), send this:

	User-agent: *
	Disallow: /

  This is the crux of the diagreement between myself and Drew---I didn't
explain my concerns very well, and he didn't pick up on the actual issue I
had (so my fault here).  A web proxy can inadvertently allow indexers,
archivers, researchers and others access to Gemini content.

  -spc

[1]	Indexers, archivers and research bots tend to respect robots.txt.
	It's the "other" class that don't.  These "other" bots are typically
	looking for exploits and there's not much you can do about these
	other than outright ban the IP they're coming from [2].  

[2]	And even then it's a game of "whack-a-mole", although if a web proxy
	sees a bunch of requests from a single IP address that result in a
	bunch of "not found" errors from Gemini (say, a threshhold of 10
	such results in a row) then that IP is automatically banned for a
	period of time (say, 48 hours---enough to let it finish its job, but
	not forever since the list of IPs will grow).

Link to individual message.

45. John Cowan (cowan (a) ccil.org)

On Wed, Nov 25, 2020 at 12:01 PM Nick Thomas <gemini at ur.gs> wrote:

It would definitely be interesting if you had an experiment or
> reference demonstrating that archive.org ignores robots.txt in general,
> but this page simply isn't it.


Okay, I rediscovered the page I was looking for: <
https://webmasters.stackexchange.com/questions/71377/how-to-properly-disall
ow-the-archive-org-bot-did-things-change-if-so-when/
>.

Search on that page for "random item on eBay".  This test shows that as of
May 2017, the IA was supporting robots.txt.  I tried this myself, and it
shows three crawls, two in 2019 and one in 2020, that agree with what you
see ("Unknown item") if you follow the direct link.  This agrees with the
claim that robots.txt was turned off in 2018 for all sites; however,
apparently the IA is not announcing this.  My guess (only a guess) is that
IA thinks that people who don't want to be archived will start using less
reliable mechanisms like blocking IP addresses.

Now search farther down the page for "just did a quick test".  This shows
that as of March 2017 the IA was refusing to display pages put off-limits
by robots.txt, consistently with the above.  However, when the robots.txt
entry was removed, crawls from 2010 through 2017 suddenly appeared!  So
even in the pre-2018 regime, the IA was crawling the pages but hiding them.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
If I have seen farther than others, it is because I was looking through a
spyglass with my one good eye, with a parrot standing on my shoulder. --"Y"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201125/d8d8
89c9/attachment.htm>

Link to individual message.

46. Krixano (krixano (a) protonmail.com)

Gaah, ok, I hit reply instead of reply all... so my messages were sent 
directly to the people, lol. I'll repost them here:

I want to point out that making the assumption that a lack of robots.txt
is because servers don't mind they're content being archived is a leap in logic
that doesn't actually follow/make sense. A server/user could have just forgotten
to put a robots.txt, *or* they could have just not known about it.

> A personal example: *I* didn't have a robots.txt on my capsule file 
until today, but I don't want to be included in archives for various 
reasons. Presuming consent from the lack of a robots.txt file would have 
incorrectly guessed my preference, and harmed my privacy. Who else in that 
90% is like me? We don't know.

Exactly! When I first got my server up, I didn't have a robots.txt for the 
longest time. Some of my content was actually not supposed to be archived 
because it was dynamic stuff. And other stuff I didn't necessarily want archived.

Christian Seibold

Sent with [ProtonMail](https://protonmail.com/) Secure Email.

??????? Original Message ???????
On Wednesday, November 25th, 2020 at 9:10 AM, John Cowan <cowan at ccil.org> wrote:

> On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote:
>
>> (Received off-list, but I assume it was *meant* for the list, so
>> replying there)
>
> It was, so thanks. My private messages are labeled (Private message) at 
the top because I make this mistake a lot.
>
>> Whatever the outcome of the opt-in vs opt-out part of this discussion,
>
> That's the only part that concerns me. A robots.txt spec is good and 
crawlers/archivers that respect it are fine too, though of course some won't.
>
> I once wrote to the author of a magazine article who had published a 
simple crawler that it would hammer whatever server it was crawling, since 
it did not delay between requests or intersperse them with requests to 
other servers, but simply walked the server's tree depth-first. and that 
it should respect robots.txt. He wrote back saying "That's the Internet 
today; deal with it." I could have answered (but I didn't) that hits are a 
cost to the server operator, and anyone running his dumb crawler was not 
only DDOSing, but spending my money for his own purposes.
>
> But I do think that once robots.txt support is in place, no robots.txt = 
no expressed preference.
>
>> If it's true for people with an explicit preference, it can also be
>> true for people who haven't expressed a preference yet. Since Gemini
>> has a higher standard for user privacy than the web, it can also have a
>> higher standard for these preferences - one that does not rely on
>> presumed consent - if we want it to.
>
> By this logic, nobody should be able to access a Gemini server at all 
unless the capsule author has expressed a preference for them to do so. 
But to publish is to expose your work to the public.
>
>> The FAQ immediately above the one you quoted reads:
>>
>>> Why isn't the site I'm looking for in the archive?*
>>
>>> Some sites may not be included because the automated crawlers were
>>> unaware of their existence at the time of the crawl. It's also
>>> possible that some sites were not archived because they were
>>> password protected, blocked by robots.txt, or otherwise inaccessible
>>> to our automated systems. Site owners might have also requested that
>>> their sites be excluded from the Wayback Machine.
>
> I interpret that to mean that some sites were not crawled during the 
period when the Archive was paying attention to robots.txt, and so their 
content as of that date is unavailable. Note the past tense: "were [...] 
protected by robots.txt" as opposed to "are protected".
>
>> If archive.org didn't respect robots.txt at all, it would lend a lot of
>> flavour to the "archiver" virtual user-agent idea in the companion
>> spec, in addition to this discussion. Do you still have doubts after
>> reading this section?
>
> I have no doubt whatever that the crawler doesn't respect robots.txt. I 
could do a little experiment, though.
>
> John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org
> The competent programmer is fully aware of the strictly limited size of his own
> skull; therefore he approaches the programming task in full humility, and among
> other things he avoids clever tricks like the plague. --Edsger Dijkstra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201126/ae3e
48ea/attachment.htm>

Link to individual message.

47. Krixano (krixano (a) protonmail.com)

I'm not sure why Internet Archive matters here. Just because they do 
something doesn't mean
it's the right thing to do. Seems like an appeal to authority to me.

Christian Seibold

Sent with [ProtonMail](https://protonmail.com/) Secure Email.

??????? Original Message ???????
On Wednesday, November 25th, 2020 at 9:10 AM, John Cowan <cowan at ccil.org> wrote:

> On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote:
>
>> (Received off-list, but I assume it was *meant* for the list, so
>> replying there)
>
> It was, so thanks. My private messages are labeled (Private message) at 
the top because I make this mistake a lot.
>
>> Whatever the outcome of the opt-in vs opt-out part of this discussion,
>
> That's the only part that concerns me. A robots.txt spec is good and 
crawlers/archivers that respect it are fine too, though of course some won't.
>
> I once wrote to the author of a magazine article who had published a 
simple crawler that it would hammer whatever server it was crawling, since 
it did not delay between requests or intersperse them with requests to 
other servers, but simply walked the server's tree depth-first. and that 
it should respect robots.txt. He wrote back saying "That's the Internet 
today; deal with it." I could have answered (but I didn't) that hits are a 
cost to the server operator, and anyone running his dumb crawler was not 
only DDOSing, but spending my money for his own purposes.
>
> But I do think that once robots.txt support is in place, no robots.txt = 
no expressed preference.
>
>> If it's true for people with an explicit preference, it can also be
>> true for people who haven't expressed a preference yet. Since Gemini
>> has a higher standard for user privacy than the web, it can also have a
>> higher standard for these preferences - one that does not rely on
>> presumed consent - if we want it to.
>
> By this logic, nobody should be able to access a Gemini server at all 
unless the capsule author has expressed a preference for them to do so. 
But to publish is to expose your work to the public.
>
>> The FAQ immediately above the one you quoted reads:
>>
>>> Why isn't the site I'm looking for in the archive?*
>>
>>> Some sites may not be included because the automated crawlers were
>>> unaware of their existence at the time of the crawl. It's also
>>> possible that some sites were not archived because they were
>>> password protected, blocked by robots.txt, or otherwise inaccessible
>>> to our automated systems. Site owners might have also requested that
>>> their sites be excluded from the Wayback Machine.
>
> I interpret that to mean that some sites were not crawled during the 
period when the Archive was paying attention to robots.txt, and so their 
content as of that date is unavailable. Note the past tense: "were [...] 
protected by robots.txt" as opposed to "are protected".
>
>> If archive.org didn't respect robots.txt at all, it would lend a lot of
>> flavour to the "archiver" virtual user-agent idea in the companion
>> spec, in addition to this discussion. Do you still have doubts after
>> reading this section?
>
> I have no doubt whatever that the crawler doesn't respect robots.txt. I 
could do a little experiment, though.
>
> John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org
> The competent programmer is fully aware of the strictly limited size of his own
> skull; therefore he approaches the programming task in full humility, and among
> other things he avoids clever tricks like the plague. --Edsger Dijkstra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201126/fd3e
3fb9/attachment-0001.htm>

Link to individual message.

48. Krixano (krixano (a) protonmail.com)

One more thing I want to point out... copyright law isn't opt-in. It's opt-out.
If you don't have a copyright statement or any other licensing information,
then "all rights reserved" is automatically assumed, afaik. You can't just copy
something just because the author didn't explicitly disallow you from doing that.

Christian Seibold

Sent with [ProtonMail](https://protonmail.com/) Secure Email.

??????? Original Message ???????
On Thursday, November 26th, 2020 at 12:12 AM, Krixano <krixano at protonmail.com> wrote:

> I'm not sure why Internet Archive matters here. Just because they do 
something doesn't mean
> it's the right thing to do. Seems like an appeal to authority to me.
>
> Christian Seibold
>
> Sent with [ProtonMail](https://protonmail.com/) Secure Email.
>
> ??????? Original Message ???????
> On Wednesday, November 25th, 2020 at 9:10 AM, John Cowan <cowan at ccil.org> wrote:
>
>> On Wed, Nov 25, 2020 at 6:32 AM Nick Thomas <gemini at ur.gs> wrote:
>>
>>> (Received off-list, but I assume it was *meant* for the list, so
>>> replying there)
>>
>> It was, so thanks. My private messages are labeled (Private message) at 
the top because I make this mistake a lot.
>>
>>> Whatever the outcome of the opt-in vs opt-out part of this discussion,
>>
>> That's the only part that concerns me. A robots.txt spec is good and 
crawlers/archivers that respect it are fine too, though of course some won't.
>>
>> I once wrote to the author of a magazine article who had published a 
simple crawler that it would hammer whatever server it was crawling, since 
it did not delay between requests or intersperse them with requests to 
other servers, but simply walked the server's tree depth-first. and that 
it should respect robots.txt. He wrote back saying "That's the Internet 
today; deal with it." I could have answered (but I didn't) that hits are a 
cost to the server operator, and anyone running his dumb crawler was not 
only DDOSing, but spending my money for his own purposes.
>>
>> But I do think that once robots.txt support is in place, no robots.txt 
= no expressed preference.
>>
>>> If it's true for people with an explicit preference, it can also be
>>> true for people who haven't expressed a preference yet. Since Gemini
>>> has a higher standard for user privacy than the web, it can also have a
>>> higher standard for these preferences - one that does not rely on
>>> presumed consent - if we want it to.
>>
>> By this logic, nobody should be able to access a Gemini server at all 
unless the capsule author has expressed a preference for them to do so. 
But to publish is to expose your work to the public.
>>
>>> The FAQ immediately above the one you quoted reads:
>>>
>>>> Why isn't the site I'm looking for in the archive?*
>>>
>>>> Some sites may not be included because the automated crawlers were
>>>> unaware of their existence at the time of the crawl. It's also
>>>> possible that some sites were not archived because they were
>>>> password protected, blocked by robots.txt, or otherwise inaccessible
>>>> to our automated systems. Site owners might have also requested that
>>>> their sites be excluded from the Wayback Machine.
>>
>> I interpret that to mean that some sites were not crawled during the 
period when the Archive was paying attention to robots.txt, and so their 
content as of that date is unavailable. Note the past tense: "were [...] 
protected by robots.txt" as opposed to "are protected".
>>
>>> If archive.org didn't respect robots.txt at all, it would lend a lot of
>>> flavour to the "archiver" virtual user-agent idea in the companion
>>> spec, in addition to this discussion. Do you still have doubts after
>>> reading this section?
>>
>> I have no doubt whatever that the crawler doesn't respect robots.txt. I 
could do a little experiment, though.
>>
>> John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org
>> The competent programmer is fully aware of the strictly limited size of his own
>> skull; therefore he approaches the programming task in full humility, and among
>> other things he avoids clever tricks like the plague. --Edsger Dijkstra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201126/625a
dbc1/attachment.htm>

Link to individual message.

49. Robert "khuxkm" Miles (khuxkm (a) tilde.team)

Hi Krixano,

A few thoughts. First of all, please use plaintext email in future
(https://useplaintext.email/#protonmail, as your signature clearly 
indicated you use protonmail I
can give you the direct link). It's easier for a lot of us to access.

Regarding your first email:

November 26, 2020 1:09 AM, "Krixano" <krixano at protonmail.com> wrote:

> I want to point out that making the assumption that a lack of robots.txt
> is because servers don't mind they're content being archived is a leap in logic
> that doesn't actually follow/make sense. A server/user could have just forgotten
> to put a robots.txt, *or* they could have just not known about it.
> 
>> A personal example: *I* didn't have a robots.txt on my capsule file 
until today, but I don't want
> 
> to be included in archives for various reasons. Presuming consent from 
the lack of a robots.txt
> file would have incorrectly guessed my preference, and harmed my 
privacy. Who else in that 90% is
> like me? We don't know.
> 
> Exactly! When I first got my server up, I didn't have a robots.txt for 
the longest time. Some of my
> content was actually not supposed to be archived because it was dynamic 
stuff. And other stuff I
> didn't necessarily want archived.

It may be a leap of logic, sure, but it's the same leap of logic that has 
been all but codified as
law (see court cases like Field v. Google, where judges have determined 
that not including a
robots.txt or no-archive tag grants an implied license to archive). As 
stated in the case summary of Field v. Google:

> Author granted operator of Internet search engine implied license to 
display "cached" links to web pages containing his copyrighted works when 
author consciously chose not to include no-archive meta-tag on pages of 
his website, despite knowing that including meta-tag would have informed 
operator not to display "cached" links to his pages and that absence of 
meta-tag would be interpreted by operator as permission to allow access to 
his web pages via "cached" links.

Google's "cached" pages system is essentially an archive under a different 
name. Is it truly a leap of logic if even a court of law comes to the same decision?

November 26, 2020 1:12 AM, "Krixano" <krixano at protonmail.com> wrote:

> I'm not sure why Internet Archive matters here. Just because they do 
something doesn't mean
> it's the right thing to do. Seems like an appeal to authority to me.

It is an appeal to authority, but not a fallacious one. The Internet 
Archive is (as far as I know)
the biggest archiving group on the Internet. If they do something, it's 
not entirely beyond reason
to assume other people do the same. A wide variety of people who do 
archiving that I've spoken to
have the same attitude: they'll still archive it, but they won't make the 
archive available to the
public if they aren't supposed to.

November 26, 2020 1:18 AM, "Krixano" <krixano at protonmail.com> wrote:

> One more thing I want to point out... copyright law isn't opt-in. It's opt-out.
> If you don't have a copyright statement or any other licensing information,
> then "all rights reserved" is automatically assumed, afaik. You can't just copy
> something just because the author didn't explicitly disallow you from doing that.

Again, see above; the law is on the side of the person assuming robots.txt 
is a system for opting out of indexing/archiving/etc. 

Just my two cents,
Robert "khuxkm" Miles

Link to individual message.

50. Krixano (krixano (a) protonmail.com)

> Google's "cached" pages system is essentially an archive under a 
different name. Is it truly a leap of logic if even a court of law comes 
to the same decision?

Yes, the court clearly made a leap in logic. Courts don't always follow logic,
because it's not efficient to do so.

Btw, the court case is only in the district of Nevada. And I'm honestly surprised
by this, considering that you do *not* have to explicitly assert your copyright
in order for copyright to apply. It seems this particular court thought caching
was an exception, unfortunately. Pretty disgusting.

Anyways, if I find any site archiving any of the stuff from my server, I'll be
looking into DMCA takedowns, because I don't tolerate utter disrespect for users'
content like that. It's disgusting.

Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 1:09 AM, Robert "khuxkm" Miles <khuxkm 
at tilde.team> wrote:

> Hi Krixano,
>
> A few thoughts. First of all, please use plaintext email in future
>
> (https://useplaintext.email/#protonmail, as your signature clearly 
indicated you use protonmail I
>
> can give you the direct link). It's easier for a lot of us to access.
>
> Regarding your first email:
>
> November 26, 2020 1:09 AM, "Krixano" krixano at protonmail.com wrote:
>
> > I want to point out that making the assumption that a lack of robots.txt
> >
> > is because servers don't mind they're content being archived is a leap in logic
> >
> > that doesn't actually follow/make sense. A server/user could have just forgotten
> >
> > to put a robots.txt, or they could have just not known about it.
> >
> > > A personal example: I didn't have a robots.txt on my capsule file 
until today, but I don't want
> >
> > to be included in archives for various reasons. Presuming consent from 
the lack of a robots.txt
> >
> > file would have incorrectly guessed my preference, and harmed my 
privacy. Who else in that 90% is
> >
> > like me? We don't know.
> >
> > Exactly! When I first got my server up, I didn't have a robots.txt for 
the longest time. Some of my
> >
> > content was actually not supposed to be archived because it was 
dynamic stuff. And other stuff I
> >
> > didn't necessarily want archived.
>
> It may be a leap of logic, sure, but it's the same leap of logic that 
has been all but codified as
>
> law (see court cases like Field v. Google, where judges have determined 
that not including a
>
> robots.txt or no-archive tag grants an implied license to archive). As 
stated in the case summary of Field v. Google:
>
> > Author granted operator of Internet search engine implied license to 
display "cached" links to web pages containing his copyrighted works when 
author consciously chose not to include no-archive meta-tag on pages of 
his website, despite knowing that including meta-tag would have informed 
operator not to display "cached" links to his pages and that absence of 
meta-tag would be interpreted by operator as permission to allow access to 
his web pages via "cached" links.
>
> Google's "cached" pages system is essentially an archive under a 
different name. Is it truly a leap of logic if even a court of law comes 
to the same decision?
>
> November 26, 2020 1:12 AM, "Krixano" krixano at protonmail.com wrote:
>
> > I'm not sure why Internet Archive matters here. Just because they do 
something doesn't mean
> >
> > it's the right thing to do. Seems like an appeal to authority to me.
>
> It is an appeal to authority, but not a fallacious one. The Internet 
Archive is (as far as I know)
>
> the biggest archiving group on the Internet. If they do something, it's 
not entirely beyond reason
>
> to assume other people do the same. A wide variety of people who do 
archiving that I've spoken to
>
> have the same attitude: they'll still archive it, but they won't make 
the archive available to the
>
> public if they aren't supposed to.
>
> November 26, 2020 1:18 AM, "Krixano" krixano at protonmail.com wrote:
>
> > One more thing I want to point out... copyright law isn't opt-in. It's opt-out.
> >
> > If you don't have a copyright statement or any other licensing information,
> >
> > then "all rights reserved" is automatically assumed, afaik. You can't just copy
> >
> > something just because the author didn't explicitly disallow you from doing that.
>
> Again, see above; the law is on the side of the person assuming 
robots.txt is a system for opting out of indexing/archiving/etc.
>
> Just my two cents,
>
> Robert "khuxkm" Miles

Link to individual message.

51. Krixano (krixano (a) protonmail.com)

My emails, afaik, should already be in plaintext. Not sure if something 
got messed up with a previous email, but anyways.... It shows Plain Text 
as being selected, so yeah.

Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 1:24 AM, Krixano <krixano at protonmail.com> wrote:

> > Google's "cached" pages system is essentially an archive under a 
different name. Is it truly a leap of logic if even a court of law comes 
to the same decision?
>
> Yes, the court clearly made a leap in logic. Courts don't always follow logic,
>
> because it's not efficient to do so.
>
> Btw, the court case is only in the district of Nevada. And I'm honestly surprised
>
> by this, considering that you do not have to explicitly assert your copyright
>
> in order for copyright to apply. It seems this particular court thought caching
>
> was an exception, unfortunately. Pretty disgusting.
>
> Anyways, if I find any site archiving any of the stuff from my server, I'll be
>
> looking into DMCA takedowns, because I don't tolerate utter disrespect for users'
>
> content like that. It's disgusting.
>
> Christian Seibold
>
> Sent with ProtonMail Secure Email.
>
> ??????? Original Message ???????
>
> On Thursday, November 26th, 2020 at 1:09 AM, Robert "khuxkm" Miles 
khuxkm at tilde.team wrote:
>
> > Hi Krixano,
> >
> > A few thoughts. First of all, please use plaintext email in future
> >
> > (https://useplaintext.email/#protonmail, as your signature clearly 
indicated you use protonmail I
> >
> > can give you the direct link). It's easier for a lot of us to access.
> >
> > Regarding your first email:
> >
> > November 26, 2020 1:09 AM, "Krixano" krixano at protonmail.com wrote:
> >
> > > I want to point out that making the assumption that a lack of robots.txt
> > >
> > > is because servers don't mind they're content being archived is a leap in logic
> > >
> > > that doesn't actually follow/make sense. A server/user could have just forgotten
> > >
> > > to put a robots.txt, or they could have just not known about it.
> > >
> > > > A personal example: I didn't have a robots.txt on my capsule file 
until today, but I don't want
> > >
> > > to be included in archives for various reasons. Presuming consent 
from the lack of a robots.txt
> > >
> > > file would have incorrectly guessed my preference, and harmed my 
privacy. Who else in that 90% is
> > >
> > > like me? We don't know.
> > >
> > > Exactly! When I first got my server up, I didn't have a robots.txt 
for the longest time. Some of my
> > >
> > > content was actually not supposed to be archived because it was 
dynamic stuff. And other stuff I
> > >
> > > didn't necessarily want archived.
> >
> > It may be a leap of logic, sure, but it's the same leap of logic that 
has been all but codified as
> >
> > law (see court cases like Field v. Google, where judges have 
determined that not including a
> >
> > robots.txt or no-archive tag grants an implied license to archive). As 
stated in the case summary of Field v. Google:
> >
> > > Author granted operator of Internet search engine implied license to 
display "cached" links to web pages containing his copyrighted works when 
author consciously chose not to include no-archive meta-tag on pages of 
his website, despite knowing that including meta-tag would have informed 
operator not to display "cached" links to his pages and that absence of 
meta-tag would be interpreted by operator as permission to allow access to 
his web pages via "cached" links.
> >
> > Google's "cached" pages system is essentially an archive under a 
different name. Is it truly a leap of logic if even a court of law comes 
to the same decision?
> >
> > November 26, 2020 1:12 AM, "Krixano" krixano at protonmail.com wrote:
> >
> > > I'm not sure why Internet Archive matters here. Just because they do 
something doesn't mean
> > >
> > > it's the right thing to do. Seems like an appeal to authority to me.
> >
> > It is an appeal to authority, but not a fallacious one. The Internet 
Archive is (as far as I know)
> >
> > the biggest archiving group on the Internet. If they do something, 
it's not entirely beyond reason
> >
> > to assume other people do the same. A wide variety of people who do 
archiving that I've spoken to
> >
> > have the same attitude: they'll still archive it, but they won't make 
the archive available to the
> >
> > public if they aren't supposed to.
> >
> > November 26, 2020 1:18 AM, "Krixano" krixano at protonmail.com wrote:
> >
> > > One more thing I want to point out... copyright law isn't opt-in. It's opt-out.
> > >
> > > If you don't have a copyright statement or any other licensing information,
> > >
> > > then "all rights reserved" is automatically assumed, afaik. You can't just copy
> > >
> > > something just because the author didn't explicitly disallow you from doing that.
> >
> > Again, see above; the law is on the side of the person assuming 
robots.txt is a system for opting out of indexing/archiving/etc.
> >
> > Just my two cents,
> >
> > Robert "khuxkm" Miles

Link to individual message.

52. BjΓΆrn WΓ€rmedal (bjorn.warmedal (a) gmail.com)

> > Google's "cached" pages system is essentially an archive under a 
different name. Is it truly a leap of logic if even a court of law comes 
to the same decision?
>
> Yes, the court clearly made a leap in logic. Courts don't always follow logic,
> because it's not efficient to do so.

Courts don't always follow logic, but they often follow precedent.

> Anyways, if I find any site archiving any of the stuff from my server, I'll be
> looking into DMCA takedowns, because I don't tolerate utter disrespect for users'
> content like that. It's disgusting.

And a DMCA takedown notice is a legal measure, which needs a legal
footing. And it would get that by using robots.txt files as
established in precedent.

Link to individual message.

53. Krixano (krixano (a) protonmail.com)

The court case (Field v. Google) was only in the district of Nevada. It doesn't apply
to all of the US, and it doesn't apply to people outside of the US.

Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 1:32 AM, Bj?rn W?rmedal 
<bjorn.warmedal at gmail.com> wrote:

> > > Google's "cached" pages system is essentially an archive under a 
different name. Is it truly a leap of logic if even a court of law comes 
to the same decision?
> >
> > Yes, the court clearly made a leap in logic. Courts don't always follow logic,
> >
> > because it's not efficient to do so.
>
> Courts don't always follow logic, but they often follow precedent.
>
> > Anyways, if I find any site archiving any of the stuff from my server, I'll be
> >
> > looking into DMCA takedowns, because I don't tolerate utter disrespect for users'
> >
> > content like that. It's disgusting.
>
> And a DMCA takedown notice is a legal measure, which needs a legal
>
> footing. And it would get that by using robots.txt files as
>
> established in precedent.

Link to individual message.

54. Robert "khuxkm" Miles (khuxkm (a) tilde.team)

November 26, 2020 2:24 AM, "Krixano" <krixano at protonmail.com> wrote:

>> Google's "cached" pages system is essentially an archive under a 
different name. Is it truly a leap
>> of logic if even a court of law comes to the same decision?
> 
> Yes, the court clearly made a leap in logic. Courts don't always follow logic,
> because it's not efficient to do so.
> 
> Btw, the court case is only in the district of Nevada. And I'm honestly surprised
> by this, considering that you do *not* have to explicitly assert your copyright
> in order for copyright to apply. It seems this particular court thought caching
> was an exception, unfortunately. Pretty disgusting.
> 
> Anyways, if I find any site archiving any of the stuff from my server, I'll be
> looking into DMCA takedowns, because I don't tolerate utter disrespect for users'
> content like that. It's disgusting.
> 
> Christian Seibold

Obviously, Bj?rn's counter-argument is correct. The courts follow 
precedent, and this precedent already exists.

The only thing I want to add is: notice how the plaintiff Field didn't 
appeal. If he truly had a case, like you seem to believe he did, surely he 
would have appealed?

Just my two cents,
Robert "khuxkm" Miles

Link to individual message.

55. Krixano (krixano (a) protonmail.com)

He didn't have a case because courts rule on multiple things, not just one thing.
Stop trying to twist information. This is what the court ruled:

--------------------------------------

The District Court, Jones, J., held that:

    1.) Operator did not directly infringe on author's copyrighted works;
    2.) Author granted operator implied license to display "cached" links 
to web pages containing his copyrighted works;
    3.) Author was estopped from asserting copyright infringement claim against operator;
    4.) Fair use doctrine protected operator's use of author's works; and
    5.) Search engine fell within protection of safe harbor provision of 
Digital Millennium Copyright Act (DMCA).

Summary judgment for operator.

The court held that "Field decided to manufacture a claim for copyright 
infringement against Google in the hopes of making-money from Google's 
standard practice." The court then went on to rule in Google's favor on 
all of its defense theories.

--------------------------------------

What does this tell us? It tells us that even if he won the implied license,
he would have lost the case anyways because Google had Fair Use.

Anyways, you're the one who brought up this court case, not me. I don't agree with
the court, and I don't have to agree with the court, and neither does any other
gemini user. Mind you, the spec isn't for legality, it's for gemini users and what
they think. The gemini spec won't affect any legal things at all.


Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 1:41 AM, Robert "khuxkm" Miles <khuxkm 
at tilde.team> wrote:

> November 26, 2020 2:24 AM, "Krixano" krixano at protonmail.com wrote:
>
> > > Google's "cached" pages system is essentially an archive under a 
different name. Is it truly a leap
> > >
> > > of logic if even a court of law comes to the same decision?
> >
> > Yes, the court clearly made a leap in logic. Courts don't always follow logic,
> >
> > because it's not efficient to do so.
> >
> > Btw, the court case is only in the district of Nevada. And I'm honestly surprised
> >
> > by this, considering that you do not have to explicitly assert your copyright
> >
> > in order for copyright to apply. It seems this particular court thought caching
> >
> > was an exception, unfortunately. Pretty disgusting.
> >
> > Anyways, if I find any site archiving any of the stuff from my server, I'll be
> >
> > looking into DMCA takedowns, because I don't tolerate utter disrespect for users'
> >
> > content like that. It's disgusting.
> >
> > Christian Seibold
>
> Obviously, Bj?rn's counter-argument is correct. The courts follow 
precedent, and this precedent already exists.
>
> The only thing I want to add is: notice how the plaintiff Field didn't 
appeal. If he truly had a case, like you seem to believe he did, surely he 
would have appealed?
>
> Just my two cents,
>
> Robert "khuxkm" Miles

Link to individual message.

56. Robert "khuxkm" Miles (khuxkm (a) tilde.team)

November 26, 2020 2:39 AM, "Krixano" <krixano at protonmail.com> wrote:

> The court case (Field v. Google) was only in the district of Nevada. It doesn't apply
> to all of the US, and it doesn't apply to people outside of the US.

A precedent is a precedent is a precedent. A district court is, in fact, a 
federal court, meaning
that any district court in the US could see Field as a precedent they should follow.

I'm sure there are other cases like Field v Google that hold the same 
thing to be true, even in a European court; it's just
that Field v Google was brought up earlier in this thread, so it's the one I know about.

Just my two cents,
Robert "khuxkm" Miles

Link to individual message.

57. Krixano (krixano (a) protonmail.com)

I never argued it wasn't a precedent. However, it hasn't gone up to the
supreme court yet, who is the final arbiter for federal concerns.

Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 1:48 AM, Robert "khuxkm" Miles <khuxkm 
at tilde.team> wrote:

> November 26, 2020 2:39 AM, "Krixano" krixano at protonmail.com wrote:
>
> > The court case (Field v. Google) was only in the district of Nevada. It doesn't apply
> >
> > to all of the US, and it doesn't apply to people outside of the US.
>
> A precedent is a precedent is a precedent. A district court is, in fact, 
a federal court, meaning
>
> that any district court in the US could see Field as a precedent they should follow.
>
> I'm sure there are other cases like Field v Google that hold the same 
thing to be true, even in a European court; it's just
>
> that Field v Google was brought up earlier in this thread, so it's the 
one I know about.
>
> Just my two cents,
>
> Robert "khuxkm" Miles

Link to individual message.

58. Robert "khuxkm" Miles (khuxkm (a) tilde.team)

November 26, 2020 2:47 AM, "Krixano" <krixano at protonmail.com> wrote:

> He didn't have a case because courts rule on multiple things, not just one thing.
> Stop trying to twist information. This is what the court ruled:

I'm not trying to twist information. I feel like your argument hinges on 
him having been able to
also successfully argue the fair use angle.

> What does this tell us? It tells us that even if he won the implied license,
> he would have lost the case anyways because Google had Fair Use.

So an archive counts as fair use then. A non commercial archive can use 
Field as precedent: it's
for archival purposes, the work is available for free online, it may be a 
complete archive but the
full work is available for free online, and there's no market for 
someone's random prose that they
make available for free.

Ergo, anyone can make an archive of anything they aren't explicitly told 
not to via robots.txt (at
least in the US) and get away with it.

> Anyways, you're the one who brought up this court case, not me. I don't agree with
> the court, and I don't have to agree with the court, and neither does any other
> gemini user. Mind you, the spec isn't for legality, it's for gemini users and what
> they think. The gemini spec won't affect any legal things at all.

Okay, but "gemini users and what they think" won't matter. The only place 
to seek relief is a court
of law, and the court of law is firmly against you here.

While I was drafting this you responded to my other email, so I'll merge 
the two replies here:

November 26, 2020 2:50 AM, "Krixano" <krixano at protonmail.com> wrote:

> I never argued it wasn't a precedent. However, it hasn't gone up to the
> supreme court yet, who is the final arbiter for federal concerns.

Well, if the case never made it to the Supreme Court, then the lower 
court's ruling stands. Ergo, it's still a precedent and most courts in the 
US would still follow it.

Just my two cents,
Robert "khuxkm" Miles

Link to individual message.

59. Sean Conner (sean (a) conman.org)

It was thus said that the Great Krixano once stated:
> 
> Exactly! When I first got my server up, I didn't have a robots.txt for the
> longest time. Some of my content was actually not supposed to be archived
> because it was dynamic stuff. And other stuff I didn't necessarily want
> archived.

  It is weird to think of autonomous agents crawling the Internet, but they
exist.  They can make requests just as humans (using a program) can make
requests.  The server has no concept of who or what is behind any given
request, and this is expecially true for Gemini (as it has no concept of a
user-agent identifier being sent).  This was a problem with HTTP in the
early days as well, and in 1994 (only five years after HTTP was created) an
ad-hoc method was developed to help guide autonomous agents in avoiding
particular areas that could lead to infinite holes of requests.

  Yes, it's sad that you had to learn about this the hard way.  Yes, the
Gemini spec should make mention of the robots.txt standard, and perhaps
servers can issue a warning if a robots.txt file is missing.  Or perhaps
they can include a sample robots.txt file for the end user to modify.  I
just recently added a sample robots.txt file to my server source code [1].

  I first learned of robots.txt in the 90s.  I started seeing requests to
"/robots.txt" in the logs, and curious about it, found it was an ad-hoc
standard to control autonomous agents.  I wonder if making an autonomous
agent to *just* request /robots.txt, making it show up in logs [2], will do
any good.  This is how I also found out about /humans.txt [3] (and about a
bazillion ways a web server can be exploited, but I digress).

  -spc

[1]	https://github.com/spc476/GLV-1.12556

	It's under the share directory.  But I can see that I should clarify
	one of the comments in that file, because it will only block
	autonomous agents that follow robots.txt, as it's advisory and not
	something that can be automatically enforced.

[2]	I know logging is also pretty contentious in Geminispace.

[3]	http://humanstxt.org/

Link to individual message.

60. Krixano (krixano (a) protonmail.com)

First of all, lets not conflate a spec with law. The spec
doesn't have to follow law. A spec is a guideline, it doesn't
have to match law, and it doesn't have to be adhered to either.

Secondly, let's actually look at what the court ruled here, on the implied license front:
> consent to use the copyrighted work need not be manifested verbally and 
may be inferred based on silence where the copyright holder knows of the 
use and encourages it.

Notice the "where the copyright holder knows of the use and encourages it."
That's not necessarily the case in this discussion. It was the case in that court case.
That court case literally doesn't apply here. Especially since Field 
explicitly added code
so that search engines would index *the URL* of the page. This is not the 
case in this discussion as the
absence of robots.txt would *not* be explicitly allowing search engines to 
index the URL of the page, and each
server that doesn't have a robots.txt would not "know of the use and encourage it".

Finally, precedents can be challenged by the Supreme Court. For example, 
the current Supreme Court case of Google v. Oracle dismissed everything 
the district courts and the Circuits had to say, because the Supreme Court 
looks at things freshly.

Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 1:57 AM, Robert "khuxkm" Miles <khuxkm 
at tilde.team> wrote:

> November 26, 2020 2:47 AM, "Krixano" krixano at protonmail.com wrote:
>
> > He didn't have a case because courts rule on multiple things, not just one thing.
> >
> > Stop trying to twist information. This is what the court ruled:
>
> I'm not trying to twist information. I feel like your argument hinges on 
him having been able to
>
> also successfully argue the fair use angle.
>
> > What does this tell us? It tells us that even if he won the implied license,
> >
> > he would have lost the case anyways because Google had Fair Use.
>
> So an archive counts as fair use then. A non commercial archive can use 
Field as precedent: it's
>
> for archival purposes, the work is available for free online, it may be 
a complete archive but the
>
> full work is available for free online, and there's no market for 
someone's random prose that they
>
> make available for free.
>
> Ergo, anyone can make an archive of anything they aren't explicitly told 
not to via robots.txt (at
>
> least in the US) and get away with it.
>
> > Anyways, you're the one who brought up this court case, not me. I don't agree with
> >
> > the court, and I don't have to agree with the court, and neither does any other
> >
> > gemini user. Mind you, the spec isn't for legality, it's for gemini users and what
> >
> > they think. The gemini spec won't affect any legal things at all.
>
> Okay, but "gemini users and what they think" won't matter. The only 
place to seek relief is a court
>
> of law, and the court of law is firmly against you here.
>
> While I was drafting this you responded to my other email, so I'll merge 
the two replies here:
>
> November 26, 2020 2:50 AM, "Krixano" krixano at protonmail.com wrote:
>
> > I never argued it wasn't a precedent. However, it hasn't gone up to the
> >
> > supreme court yet, who is the final arbiter for federal concerns.
>
> Well, if the case never made it to the Supreme Court, then the lower 
court's ruling stands. Ergo, it's still a precedent and most courts in the 
US would still follow it.
>
> Just my two cents,
>
> Robert "khuxkm" Miles

Link to individual message.

61. Robert "khuxkm" Miles (khuxkm (a) tilde.team)

This conversation is getting away from Gemini, so I'm going to wrap it up 
here and let us agree to disagree.

November 26, 2020 3:11 AM, "Krixano" <krixano at protonmail.com> wrote:

> First of all, lets not conflate a spec with law. The spec
> doesn't have to follow law. A spec is a guideline, it doesn't
> have to match law, and it doesn't have to be adhered to either.

Okay but if you wanted something for the law being broken (i.e; your 
copyright being infringed), you have to go in front of a court of law.

> Secondly, let's actually look at what the court ruled here, on the 
implied license front:
> 
>> consent to use the copyrighted work need not be manifested verbally and 
may be inferred based on
>> silence where the copyright holder knows of the use and encourages it.
> 
> Notice the "where the copyright holder knows of the use and encourages it."
> That's not necessarily the case in this discussion. It was the case in that court case.
> That court case literally doesn't apply here. Especially since Field 
explicitly added code
> so that search engines would index *the URL* of the page. This is not 
the case in this discussion
> as the
> absence of robots.txt would *not* be explicitly allowing search engines 
to index the URL of the
> page, and each
> server that doesn't have a robots.txt would not "know of the use and encourage it".

I don't know where you got the idea that Field added code to make the 
engine index the URL-- that's what a search engine does-- but I don't care at this point.

> Finally, precedents can be challenged by the Supreme Court. For example, 
the current Supreme Court
> case of Google v. Oracle dismissed everything the district courts and 
the Circuits had to say,
> because the Supreme Court looks at things freshly.

Google v Oracle is an *ongoing* case. No precedent was set, because the 
case never actually came to rest. See the EFF's page on it:

https://www.eff.org/cases/oracle-v-google

Just my two cents,
Robert "khuxkm" Miles

Link to individual message.

62. Krixano (krixano (a) protonmail.com)

Yes, it's an ongoing case, but I actually read the whole case, and I'm 
almost 100% positive they are going to rule more in favor of Oracle, 
because Google made stupid claims, one of which is that software is patentable, btw.

If you want to learn more about this, I would suggest this video series: 
https://caseorcontroversy.com/

Btw, there *was* precedent set in the lower districts of this case. To say 
there was no precedent set is misinformation.

Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 2:22 AM, Robert "khuxkm" Miles <khuxkm 
at tilde.team> wrote:

> This conversation is getting away from Gemini, so I'm going to wrap it 
up here and let us agree to disagree.
>
> November 26, 2020 3:11 AM, "Krixano" krixano at protonmail.com wrote:
>
> > First of all, lets not conflate a spec with law. The spec
> >
> > doesn't have to follow law. A spec is a guideline, it doesn't
> >
> > have to match law, and it doesn't have to be adhered to either.
>
> Okay but if you wanted something for the law being broken (i.e; your 
copyright being infringed), you have to go in front of a court of law.
>
> > Secondly, let's actually look at what the court ruled here, on the 
implied license front:
> >
> > > consent to use the copyrighted work need not be manifested verbally 
and may be inferred based on
> > >
> > > silence where the copyright holder knows of the use and encourages it.
> >
> > Notice the "where the copyright holder knows of the use and encourages it."
> >
> > That's not necessarily the case in this discussion. It was the case in 
that court case.
> >
> > That court case literally doesn't apply here. Especially since Field 
explicitly added code
> >
> > so that search engines would index the URL of the page. This is not 
the case in this discussion
> >
> > as the
> >
> > absence of robots.txt would not be explicitly allowing search engines 
to index the URL of the
> >
> > page, and each
> >
> > server that doesn't have a robots.txt would not "know of the use and encourage it".
>
> I don't know where you got the idea that Field added code to make the 
engine index the URL-- that's what a search engine does-- but I don't care at this point.
>
> > Finally, precedents can be challenged by the Supreme Court. For 
example, the current Supreme Court
> >
> > case of Google v. Oracle dismissed everything the district courts and 
the Circuits had to say,
> >
> > because the Supreme Court looks at things freshly.
>
> Google v Oracle is an ongoing case. No precedent was set, because the 
case never actually came to rest. See the EFF's page on it:
>
> https://www.eff.org/cases/oracle-v-google
>
> Just my two cents,
>
> Robert "khuxkm" Miles

Link to individual message.

63. Krixano (krixano (a) protonmail.com)

Correction, google made the case that APIs are patentable. Same difference, but still.

Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 2:25 AM, Krixano <krixano at protonmail.com> wrote:

> Yes, it's an ongoing case, but I actually read the whole case, and I'm 
almost 100% positive they are going to rule more in favor of Oracle, 
because Google made stupid claims, one of which is that software is patentable, btw.
>
> If you want to learn more about this, I would suggest this video series: 
https://caseorcontroversy.com/
>
> Btw, there was precedent set in the lower districts of this case. To say 
there was no precedent set is misinformation.
>
> Christian Seibold
>
> Sent with ProtonMail Secure Email.
>
> ??????? Original Message ???????
>
> On Thursday, November 26th, 2020 at 2:22 AM, Robert "khuxkm" Miles 
khuxkm at tilde.team wrote:
>
> > This conversation is getting away from Gemini, so I'm going to wrap it 
up here and let us agree to disagree.
> >
> > November 26, 2020 3:11 AM, "Krixano" krixano at protonmail.com wrote:
> >
> > > First of all, lets not conflate a spec with law. The spec
> > >
> > > doesn't have to follow law. A spec is a guideline, it doesn't
> > >
> > > have to match law, and it doesn't have to be adhered to either.
> >
> > Okay but if you wanted something for the law being broken (i.e; your 
copyright being infringed), you have to go in front of a court of law.
> >
> > > Secondly, let's actually look at what the court ruled here, on the 
implied license front:
> > >
> > > > consent to use the copyrighted work need not be manifested 
verbally and may be inferred based on
> > > >
> > > > silence where the copyright holder knows of the use and encourages it.
> > >
> > > Notice the "where the copyright holder knows of the use and encourages it."
> > >
> > > That's not necessarily the case in this discussion. It was the case 
in that court case.
> > >
> > > That court case literally doesn't apply here. Especially since Field 
explicitly added code
> > >
> > > so that search engines would index the URL of the page. This is not 
the case in this discussion
> > >
> > > as the
> > >
> > > absence of robots.txt would not be explicitly allowing search 
engines to index the URL of the
> > >
> > > page, and each
> > >
> > > server that doesn't have a robots.txt would not "know of the use and encourage it".
> >
> > I don't know where you got the idea that Field added code to make the 
engine index the URL-- that's what a search engine does-- but I don't care at this point.
> >
> > > Finally, precedents can be challenged by the Supreme Court. For 
example, the current Supreme Court
> > >
> > > case of Google v. Oracle dismissed everything the district courts 
and the Circuits had to say,
> > >
> > > because the Supreme Court looks at things freshly.
> >
> > Google v Oracle is an ongoing case. No precedent was set, because the 
case never actually came to rest. See the EFF's page on it:
> >
> > https://www.eff.org/cases/oracle-v-google
> >
> > Just my two cents,
> >
> > Robert "khuxkm" Miles

Link to individual message.

64. Luke Emmet (luke (a) marmaladefoo.com)



On 25-Nov-2020 00:18, Nick Thomas wrote:
>
> You're presuming consent here. We don't actually *know* that said 90%
> of hosts are happy to be archived; we only know that 90% of hosts
> haven't included a robots.txt file, which could be for any one of a
> multitude of reasons.
>
> *If* a not-insignificant proportion of those hosts without robots.txt
> files would actually prefer not to be included in archives when asked,
> the current situation is not serving their privacy well, and gemini is
> suppose to be protective of user privacy. *If* an overwhelming majority
> of them simply don't care, then sure, the argument for it starts to
> look a bit niche. Talking in IRC earlier today, I hand-waved a 5%
> threshold for the first condition and 1% for the second.
>
> A personal example: *I* didn't have a robots.txt on my capsule file
> until today, but I don't want to be included in archives for various
> reasons. Presuming consent from the lack of a robots.txt file would
> have incorrectly guessed my preference, and harmed my privacy. Who else
> in that 90% is like me? We don't know.
>
Hello all

Personally, I'm not really that interested in the legal arguments back 
and forth about archiving and access. Yes there are some legal case 
precedents in this area in some jurisdictions, but I would say that by 
and large that ship has sailed. Sorry about that folks. The web is the 
de-facto baseline reference in this respect, whether we like it or not.

If you *publish* information on the internet, there *will* be actors who 
will re-purpose it. Gemini is no different to the web in this.

If any of us have information that is to be preserved as private, I 
cannot see how you can expect that to be achieved if you publish on the 
public internet (i.e. servers that do not require authentication). If 
you want to hide something, use authentication or a private channel.

Yes there is robots.txt which is an opt-out mechanism, from general 
robot access to a server's content. It is established practice and good 
actors will respect it. But it cannot be a mechanism to preserve privacy.

My take on the whole "Gemini preserves privacy better" is really about 
clients. We don't have extended headers, cookies or agent names in 
requests. So to that extent, client privacy is maintained better than 
the web, where the expectation is of long term, cross-session tracking. 
We dont thankfully have that.

I don't see it as Gemini's role to attempt to set a cultural/legal 
privacy framework for servers who are choosing to publish on Gemini. We 
cannot imagine we can break new ground in this respect. We can however 
do our efforts to have this as a side effect of technical design in the 
protocol itself, and within the Gemini community we can look out for 
risks in exposing such personal information via the protocol.

If Gemini ever becomes interesting enough to the outside world that some 
case goes to court (what a publicity success that would be!), surely the 
existing infrastructure of public server hypertext systems, namely the 
web, will be the established precedent.

So I support use of robots.txt, but if none exists, the presumption - 
like the web -  is that access and usage is allowed. If some actor 
doesn't follow a server's robots.txt, I'm sad about it, but we should 
ultimately expect it.

  - Luke

Link to individual message.

65. marc (marcx2 (a) welz.org.za)


Hello Christian

> One more thing I want to point out... copyright law isn't opt-in. It's opt-out.
> If you don't have a copyright statement or any other licensing information,
> then "all rights reserved" is automatically assumed, afaik. You can't just copy
> something just because the author didn't explicitly disallow you from doing that.

Yes - copyright legislation hasn't been repealed :-)


world some implied license. The convention which has evolved for
the web is that without a robots.txt forbidding it, crawlers
are free to index and cache, and some other things too. The
boundaries of this are fuzzy, because the conditions weren't
stated at the outset.

But gemini isn't the web, and gemini is new, so maybe we can
do better and *not* rely on an implied license (all humans may
visit this capsule), and then a robots.txt for just one single
bit of extra information (autonomous software can crawl it too,
if not forbidden).

So many thoughtful people are hesitant to put their data
online - they fear that this may disadvantage them in
future - maybe they worry about employer discrimination, doxxing
or biometric harvesting (from facial detail to writing style)
or things not yet invented.

Given that everybody has different tolerances, a mechanism
whereby people can state their preferences would be a good
thing.

Blindly copying the web robots.txt mechanism seems to be too
coarse/too vague, and too easily decoupled.

regards

marc

-- CC-SA

Link to individual message.

66. Krixano (krixano (a) protonmail.com)

My arguments weren't just about privacy. They were also about copyright.
Sharing on the internet is fine, but copyright still applies.

Secondly, You can share something for free online for a short period of time, and
then remove it after that time limit. This was done with a lot of books during a
portion of the Covid pandemic we are in. To say that archives should be 
able to permanently
cache this without explicit permission makes no logical sense.

Anyways, back to my original argument, caching should be opt-in. It makes the most sense.


Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 4:15 AM, Luke Emmet <luke at marmaladefoo.com> wrote:

> On 25-Nov-2020 00:18, Nick Thomas wrote:
>
> > You're presuming consent here. We don't actually know that said 90%
> >
> > of hosts are happy to be archived; we only know that 90% of hosts
> >
> > haven't included a robots.txt file, which could be for any one of a
> >
> > multitude of reasons.
> >
> > If a not-insignificant proportion of those hosts without robots.txt
> >
> > files would actually prefer not to be included in archives when asked,
> >
> > the current situation is not serving their privacy well, and gemini is
> >
> > suppose to be protective of user privacy. If an overwhelming majority
> >
> > of them simply don't care, then sure, the argument for it starts to
> >
> > look a bit niche. Talking in IRC earlier today, I hand-waved a 5%
> >
> > threshold for the first condition and 1% for the second.
> >
> > A personal example: I didn't have a robots.txt on my capsule file
> >
> > until today, but I don't want to be included in archives for various
> >
> > reasons. Presuming consent from the lack of a robots.txt file would
> >
> > have incorrectly guessed my preference, and harmed my privacy. Who else
> >
> > in that 90% is like me? We don't know.
>
> Hello all
>
> Personally, I'm not really that interested in the legal arguments back
>
> and forth about archiving and access. Yes there are some legal case
>
> precedents in this area in some jurisdictions, but I would say that by
>
> and large that ship has sailed. Sorry about that folks. The web is the
>
> de-facto baseline reference in this respect, whether we like it or not.
>
> If you publish information on the internet, there will be actors who
>
> will re-purpose it. Gemini is no different to the web in this.
>
> If any of us have information that is to be preserved as private, I
>
> cannot see how you can expect that to be achieved if you publish on the
>
> public internet (i.e. servers that do not require authentication). If
>
> you want to hide something, use authentication or a private channel.
>
> Yes there is robots.txt which is an opt-out mechanism, from general
>
> robot access to a server's content. It is established practice and good
>
> actors will respect it. But it cannot be a mechanism to preserve privacy.
>
> My take on the whole "Gemini preserves privacy better" is really about
>
> clients. We don't have extended headers, cookies or agent names in
>
> requests. So to that extent, client privacy is maintained better than
>
> the web, where the expectation is of long term, cross-session tracking.
>
> We dont thankfully have that.
>
> I don't see it as Gemini's role to attempt to set a cultural/legal
>
> privacy framework for servers who are choosing to publish on Gemini. We
>
> cannot imagine we can break new ground in this respect. We can however
>
> do our efforts to have this as a side effect of technical design in the
>
> protocol itself, and within the Gemini community we can look out for
>
> risks in exposing such personal information via the protocol.
>
> If Gemini ever becomes interesting enough to the outside world that some
>
> case goes to court (what a publicity success that would be!), surely the
>
> existing infrastructure of public server hypertext systems, namely the
>
> web, will be the established precedent.
>
> So I support use of robots.txt, but if none exists, the presumption -
>
> like the web - is that access and usage is allowed. If some actor
>
> doesn't follow a server's robots.txt, I'm sad about it, but we should
>
> ultimately expect it.
>
> -   Luke

Link to individual message.

67. Krixano (krixano (a) protonmail.com)

> *But* by putting things on the web, the creator has granted the
world some implied license.

This is not true. The only implied license is to view
the thing put online. Redistributing it is not implied by putting
something online, and neither is modifying, unless it's
under Fair Use (a transformative work).

Christian Seibold

Sent with ProtonMail Secure Email.

??????? Original Message ???????

On Thursday, November 26th, 2020 at 4:18 AM, marc <marcx2 at welz.org.za> wrote:

> Hello Christian
>
> > One more thing I want to point out... copyright law isn't opt-in. It's opt-out.
> >
> > If you don't have a copyright statement or any other licensing information,
> >
> > then "all rights reserved" is automatically assumed, afaik. You can't just copy
> >
> > something just because the author didn't explicitly disallow you from doing that.
>
> Yes - copyright legislation hasn't been repealed :-)
>
> But by putting things on the web, the creator has granted the
>
> world some implied license. The convention which has evolved for
>
> the web is that without a robots.txt forbidding it, crawlers
>
> are free to index and cache, and some other things too. The
>
> boundaries of this are fuzzy, because the conditions weren't
>
> stated at the outset.
>
> But gemini isn't the web, and gemini is new, so maybe we can
>
> do better and not rely on an implied license (all humans may
>
> visit this capsule), and then a robots.txt for just one single
>
> bit of extra information (autonomous software can crawl it too,
>
> if not forbidden).
>
> So many thoughtful people are hesitant to put their data
>
> online - they fear that this may disadvantage them in
>
> future - maybe they worry about employer discrimination, doxxing
>
> or biometric harvesting (from facial detail to writing style)
>
> or things not yet invented.
>
> Given that everybody has different tolerances, a mechanism
>
> whereby people can state their preferences would be a good
>
> thing.
>
> Blindly copying the web robots.txt mechanism seems to be too
>
> coarse/too vague, and too easily decoupled.
>
> regards
>
> marc
>
> -- CC-SA

Link to individual message.

68. James Tomasino (tomasino (a) lavabit.com)

On 11/26/20 10:27 AM, Krixano wrote:
>> *But* by putting things on the web, the creator has granted the
> world some implied license.
> 
> This is not true. The only implied license is to view
> the thing put online. Redistributing it is not implied by putting
> something online, and neither is modifying, unless it's
> under Fair Use (a transformative work).
> 
> Christian Seibold

Wow this thread blew up overnight. Anyway, I was the one that first posted 
about Field v. Google as one example case about litigation related to 
search engines and copyright. In an effort to avoid more "someone is wrong 
on the internet" arguments, here's the crux:

- If you as a copyright holder want to deny your content being cached and 
served by a 3rd party (for instance a search engine) you have a well known 
mechanism to do so in robots.txt.

- If your content is archived or cached against your desires your means of 
remediation are legal ones. Taking the issue to court will result in a 
court deciding if you are within your rights to protect your content or if 
the searcher/archiver/indexer is under fair use.

- The rules around copyright and media protections are established in each 
country, but are nearly universally applied worldwide via the Berne 
Convention and/or agreements like the Electronic Commerce Directive. 

- Existing legal precedent suggests you can expect a ruling in favor of 
implied consent if you do not have a robots.txt. 

All of this is to suggest we save ourselves the trouble down the road and 
just use robots.txt as-is.


Finally, and completely unrelated to everything: it was Oracle who tried 
to claim their APIs via patent rather than the other way around. See:
https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc.#First_phas
e:_API_copyrightability_and_patents

Link to individual message.

69. marc (marcx2 (a) welz.org.za)

Hi

> I don't see it as Gemini's role to attempt to set a cultural/legal privacy
> framework for servers who are choosing to publish on Gemini. We cannot
> imagine we can break new ground in this respect.

That seems ... rather defeatist. 

Alasdair Gray provides an inspirational quote for a situation like this:

  "Work as if you live in the early days of a better nation"

(apparently later he wanted to say world, but nation had stuck...)

Gemini is still a young project, where a different culture
and nicer norms could be established...

Long ago, before the web, when the internet was young
somebody grabbed the jokes from rec.humor.funny (I think, might
have been another newsgroup) and published them in book form. Some
posters were outraged at the copyright violation, others
flattered. 

Had the individual posters just had a way of telling us
how their material could have been re-used, there would
have been no controversy, and maybe this would have
laid the groundwork for a different way of aggregating
online material, with internet editors neatly assembling
"best-ofs" or "my conversations-with-..." and people
optimising their comments for quotability or adding
footnotes and expansions to posts they were keen to
improve... instead of just feeble likes.

TLDR: I can imagine it. 

regards

marc

Link to individual message.

70. Luke Emmet (luke (a) marmaladefoo.com)



On 26-Nov-2020 16:24, marc wrote:
>> I don't see it as Gemini's role to attempt to set a cultural/legal privacy
>> framework for servers who are choosing to publish on Gemini. We cannot
>> imagine we can break new ground in this respect.
> That seems ... rather defeatist.
>
> Alasdair Gray provides an inspirational quote for a situation like this:
>
>    "Work as if you live in the early days of a better nation"
Well, I wasn't expecting to have my Utopian credentials questioned ;-)

After all, I am a proponent of Gemini like everyone else here, pushing 
against the flow.

But its true I'm probably towards the pragmatic end of the scale, and I 
like to see people discussing subjects I find to be productive. Trying 
to establish alternative IPR legal precedents, contrary to the flow of 
what happens on the web seems like a lot of work to me and we can spin a 
lot of cycles doing so. But if it rings your bell, by all means continue.

I'm all for building a nice culture among the gemini-folk, but wider 
cultural changes happen slowly in my experience.

  - Luke

Link to individual message.

---

Previous Thread: Designing a simpler alternative to TLS

Next Thread: Lightweight subscription to Gemini pages