WWW indexing concerns (was: Gemini Universal Search)

🗣️ From: solderpunk (solderpunk (a) SDF.ORG)
📅 Sent: 2020-02-27 20:02
📧 Message 6 of 6
On Wed, Feb 26, 2020 at 07:54:35PM -0800, Bradley D. Thornton wrote:

> This is preferable to me, just blocking it at the firewall level, but
> does become administratively cumbersome as critical mass is acheived and
> a curated list of proxies isn't available - if someone does maintain
> such a list, it could  just be popped into ipsets to keep the rulesets
> to a minimum.

I am happy to add something to the Best Practices document regarding
HTTP proxies, which could include a polite request to inform me of
proxies and their IP addresses so I can maintain a master list
somewhere, as well as a strong admonition to serve a robots.txt which
prevents web crawlers from slurping up Gemini content.
 
> I don't want ANYONE being able to access any of my Gemini servers via a
> browser that doesn't support Gemini either natively, or via a plug-in.
> I've been quite vocal and adament about this in the Gopher community for
> well over a decade - to me, but not most folks apparently, it defeats
> the purpose of, and incentive to, develop unique content in
> Gopher/Gemini space, since someone is simply accessing it via HTTP anyway.

I understand this sentiment, but at the end of the day it's literally
impossible to prevent this.  It's part and parcel of serving digital
content to universal machines owned and operated by other people - you
lose all control over things like this.  As was posted previously,
attempts to regain control with things like DRM just turn into arms
races that make life harder for legitimate users.  I'm in favour of
leaving things at a straightforward "gentleman's agreement".

> The problem with this method is that, let's say, there's a GUS server
> attempting to spider me on TCP 1965, but there's also some infernal HTTP
> < > Gemini proxy trying to access content on my Gemini servers from the
> same IP. I end up with an uncomfortable choice because I want to be
> indexed by GUS, but I don't want to allow anyone to use the World Wide
> Web to access my content.
> 
> >   A second one is to extend robots.txt to indicate proxying preference, or
> > some other file, but then there are multiple requests (or maybe
> > not---caching information could be included). 
 
Extending robots.txt to do this seems fairly straightforward.  We could
introduce "pseudo user-agents" like "proxy/*", "indexer/*", etc. which
all user agents of a particular type should respect.

Cheers,
Solderpunk
---
Previous in thread (5 of 6): 🗣️ Sean Conner (sean (a) conman.org)
View entire thread.