WWW indexing concerns (was: Gemini Universal Search)

Andrew Kennedy <andrew (a) 68kmentat.com>

Very cool! I want to express two points: one, I am glad to see this. As 
geminispace gets larger, a search method that doesn't rely on personal 
bookmarks will be really useful. I forget to bookmark gopher documents all 
the time, and Veronica really helps there.

As for crawling in general, this is something that I've been thinking 
about lately, and I guess now it's timely: 

HTTP/S gemini proxies allow all of the public geminispace to be indexed by 
Google and other services. I think that we generally consider the two 
current proxies to be useful (I certainly do), so to a point this is unavoidable.

proxy.vulpes.one has a robots.txt to prevent WWW crawling, which is a fair 
stopgap. portal.mozz.us does not, but I'm not angry about it. I meant to 
send out an e-mail to mozz to ask their opinion, but just haven't gotten around to it.

My issue here is that the only 2 ways to opt out of Google indexing are to 
use a robots.txt, or to register yourself as the domain owner and control 
indexing via Google's console. Both of those methods apply to the proxy 
website's owner, not to the gemini server's owner, because each gemini 
server appears like a subdirectory on the proxy's domain.

So the issue here is that the only way to opt out of being indexed is to 
contact each proxy maintainer and request that they make accommodations 
for you. That's fine with only 15 or so gemini servers, but not fair to 
proxy maintainers as gemini grows. It's also not enough to ask all proxies 
to use robots.txt, because there's nothing stopping someone from ignoring 
it either out of ignorance or in bad faith.

Perhaps there isn't much that can be done here, and this e-mail is little 
more than me venting a concern. I realize that the only way to stay 
anonymous on the internet is to constantly maintain your anonymity. I'm 
doing that in some places. But my gemini server's domain name is already 
tied to my IRL identity, and I wish that it was at least harder for my 
gemini files to be on the 1st page of a Google result.

This e-mail got a little long, and it's less formal than some of the other 
discussions. Sorry if it's bothersome. I'm not really a programmer, so I 
can't offer any solutions. But I wanted to throw this conversation into 
the general dialogue in case anyone has any thoughts or ideas here. Gemini 
is *not* the Web, after all.

- m68k


 ---- On Wed, 26 Feb 2020 07:00:02 -0500  <gemini-request at 
lists.orbitalfox.eu> wrote ----

 > One technical question is the issue of how server admins can opt out
 > of having their stuff crawled.  GUS currently recognises a /robots.txt
 > resource with (I presume) identical syntax to that used for HTTP.
 > This is certainly one potential solution to the problem (and perhaps
 > the most sensible one), but we might want to consider others.
 >

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Andrew Kennedy once stated:
> 
> So the issue here is that the only way to opt out of being indexed is to
> contact each proxy maintainer and request that they make accommodations
> for you. That's fine with only 15 or so gemini servers, but not fair to
> proxy maintainers as gemini grows. It's also not enough to ask all proxies
> to use robots.txt, because there's nothing stopping someone from ignoring
> it either out of ignorance or in bad faith.

  There are other ways.  One way is to recognize a proxy server and block
any requests from it.  I think it would be easy to recognize one because of
all the requests from a single IP address (or block of IP addresses).  The
blocking can be at a firewall level, or the gemini server could recognice
the IP (or IP block) and close the connection or return an error.  That can
be done now.

  A second one is to extend robots.txt to indicate proxying preference, or
some other file, but then there are multiple requests (or maybe
not---caching information could be included).  Heck, even a DNS record (like
a TXT RR with the contents "v=Gemini; proxy=no" with the TTL of the DNS
record being honored).  But that relies upon the good will of the proxy to
honor that data.

  Or your idea of just asking could work just as well.

  -spc

Link to individual message.

Bradley D. Thornton <Bradley (a) NorthTech.US>



On 2/26/2020 11:29 AM, Sean Conner wrote:
> It was thus said that the Great Andrew Kennedy once stated:
>>
>> So the issue here is that the only way to opt out of being indexed is to
>> contact each proxy maintainer and request that they make accommodations
>> for you. That's fine with only 15 or so gemini servers, but not fair to
>> proxy maintainers as gemini grows. It's also not enough to ask all proxies
>> to use robots.txt, because there's nothing stopping someone from ignoring
>> it either out of ignorance or in bad faith.
> 
>   There are other ways.  One way is to recognize a proxy server and block
> any requests from it.
This is preferable to me, just blocking it at the firewall level, but
does become administratively cumbersome as critical mass is acheived and
a curated list of proxies isn't available - if someone does maintain
such a list, it could  just be popped into ipsets to keep the rulesets
to a minimum.

I don't want ANYONE being able to access any of my Gemini servers via a
browser that doesn't support Gemini either natively, or via a plug-in.
I've been quite vocal and adament about this in the Gopher community for
well over a decade - to me, but not most folks apparently, it defeats
the purpose of, and incentive to, develop unique content in
Gopher/Gemini space, since someone is simply accessing it via HTTP anyway.

The problem with this method is that, let's say, there's a GUS server
attempting to spider me on TCP 1965, but there's also some infernal HTTP
< > Gemini proxy trying to access content on my Gemini servers from the
same IP. I end up with an uncomfortable choice because I want to be
indexed by GUS, but I don't want to allow anyone to use the World Wide
Web to access my content.


>   A second one is to extend robots.txt to indicate proxying preference, or
> some other file, but then there are multiple requests (or maybe
> not---caching information could be included). 

Ah yes, in a perfect world Sean :)



> Heck, even a DNS record (like
> a TXT RR with the contents "v=Gemini; proxy=no" with the TTL of the DNS
> record being honored).  But that relies upon the good will of the proxy to
> honor that data.

Again, in a perfect world ;) Either of these solutions (a TXT RR and/or
utilizing robots.txt) are ideal, sans the concerns about extra
traffic/requests.

Right now everyone, for the most part, is on this list, and the good
folks here are inclined to adhere to such friendly standards, but moving
forward as adoption builds like a snowball rolling down the mountain,
there will invariably be bad actors coming online.

One consideration worth mentioning is that, at least in my case, I tend
to have A and AAAA RRs point to a single host, and rely upon the
listening ports to determine which protocols are used to serve the
appropriate data. The way you suggested using the TXT RR would work fine
 in this case, however :)

-- 
Bradley D. Thornton
Manager Network Services
http://NorthTech.US
TEL: +1.310.421.8268

Link to individual message.

Steve Ryan <stryan (a) saintnet.tech>

On 20/02/26 02:29PM, Sean Conner wrote:
>   A second one is to extend robots.txt to indicate proxying preference, or
> some other file, but then there are multiple requests (or maybe
> not---caching information could be included).  Heck, even a DNS record (like
> a TXT RR with the contents "v=Gemini; proxy=no" with the TTL of the DNS
> record being honored).  But that relies upon the good will of the proxy to
> honor that data.
> 
>   Or your idea of just asking could work just as well.

I'm of the opinion that either a robots.txt method or TXT record will do
for preventing spiders/proxies, I feel that stronger than assuming good
faith will always lead to an arms-war, and I'm not sure for the protocol
the servers have any chance of winning a war against clients.

If something must be kept private from proxys or spiders, perhaps
requiring a client certificate might be for the best? I'm sure someone
clever than I could figure out a way to require human intervention in
creating a cert to access a page.

-Steve

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Steve Ryan once stated:
> On 20/02/26 02:29PM, Sean Conner wrote:
> >   A second one is to extend robots.txt to indicate proxying preference, or
> > some other file, but then there are multiple requests (or maybe
> > not---caching information could be included).  Heck, even a DNS record (like
> > a TXT RR with the contents "v=Gemini; proxy=no" with the TTL of the DNS
> > record being honored).  But that relies upon the good will of the proxy to
> > honor that data.
> > 
> >   Or your idea of just asking could work just as well.
> 
> I'm of the opinion that either a robots.txt method or TXT record will do
> for preventing spiders/proxies, I feel that stronger than assuming good
> faith will always lead to an arms-war, and I'm not sure for the protocol
> the servers have any chance of winning a war against clients.

  To that end, I have a TXT record for gemini.conman.org.  

	v=Gemini; proxy=no; webproxies=yes

	v=Gemini	- TXT record for Gemini

	proxy=no	- server does not support proxying requests
	proxy=yes	- server does support proxying requests

	webproxies=no	- please do not proxy this server via the web
	webproxies=yes	- web proxying is okay

  Discussion, questions, concerns, etc. welcome.

> If something must be kept private from proxys or spiders, perhaps
> requiring a client certificate might be for the best? I'm sure someone
> clever than I could figure out a way to require human intervention in
> creating a cert to access a page.

  It's fairly easy, and I do have two directories that require client
certificates:

	gemini://gemini.conman.org/private	- any client certificate
	gemini://gemini.conman.org/conman-labs-private - particular client certificates required

  -spc

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Wed, Feb 26, 2020 at 07:54:35PM -0800, Bradley D. Thornton wrote:

> This is preferable to me, just blocking it at the firewall level, but
> does become administratively cumbersome as critical mass is acheived and
> a curated list of proxies isn't available - if someone does maintain
> such a list, it could  just be popped into ipsets to keep the rulesets
> to a minimum.

I am happy to add something to the Best Practices document regarding
HTTP proxies, which could include a polite request to inform me of
proxies and their IP addresses so I can maintain a master list
somewhere, as well as a strong admonition to serve a robots.txt which
prevents web crawlers from slurping up Gemini content.
 
> I don't want ANYONE being able to access any of my Gemini servers via a
> browser that doesn't support Gemini either natively, or via a plug-in.
> I've been quite vocal and adament about this in the Gopher community for
> well over a decade - to me, but not most folks apparently, it defeats
> the purpose of, and incentive to, develop unique content in
> Gopher/Gemini space, since someone is simply accessing it via HTTP anyway.

I understand this sentiment, but at the end of the day it's literally
impossible to prevent this.  It's part and parcel of serving digital
content to universal machines owned and operated by other people - you
lose all control over things like this.  As was posted previously,
attempts to regain control with things like DRM just turn into arms
races that make life harder for legitimate users.  I'm in favour of
leaving things at a straightforward "gentleman's agreement".

> The problem with this method is that, let's say, there's a GUS server
> attempting to spider me on TCP 1965, but there's also some infernal HTTP
> < > Gemini proxy trying to access content on my Gemini servers from the
> same IP. I end up with an uncomfortable choice because I want to be
> indexed by GUS, but I don't want to allow anyone to use the World Wide
> Web to access my content.
> 
> >   A second one is to extend robots.txt to indicate proxying preference, or
> > some other file, but then there are multiple requests (or maybe
> > not---caching information could be included). 
 
Extending robots.txt to do this seems fairly straightforward.  We could
introduce "pseudo user-agents" like "proxy/*", "indexer/*", etc. which
all user agents of a particular type should respect.

Cheers,
Solderpunk

Link to individual message.

---

Previous Thread: Gemini Universal Search

Next Thread: An outsider's view of the `gemini://` protocol