💾 Archived View for rawtext.club › ~sloum › geminilist › 000441.gmi captured on 2020-09-24 at 02:34:04. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

<-- back to the mailing list

WWW indexing concerns (was: Gemini Universal Search)

solderpunk solderpunk at SDF.ORG

Thu Feb 27 20:02:57 GMT 2020

- - - - - - - - - - - - - - - - - - - ```

On Wed, Feb 26, 2020 at 07:54:35PM -0800, Bradley D. Thornton wrote:

> This is preferable to me, just blocking it at the firewall level, but
> does become administratively cumbersome as critical mass is acheived and
> a curated list of proxies isn't available - if someone does maintain
> such a list, it could  just be popped into ipsets to keep the rulesets
> to a minimum.

I am happy to add something to the Best Practices document regardingHTTP proxies, which could include a polite request to inform me ofproxies and their IP addresses so I can maintain a master listsomewhere, as well as a strong admonition to serve a robots.txt whichprevents web crawlers from slurping up Gemini content. 
> I don't want ANYONE being able to access any of my Gemini servers via a
> browser that doesn't support Gemini either natively, or via a plug-in.
> I've been quite vocal and adament about this in the Gopher community for
> well over a decade - to me, but not most folks apparently, it defeats
> the purpose of, and incentive to, develop unique content in
> Gopher/Gemini space, since someone is simply accessing it via HTTP anyway.

I understand this sentiment, but at the end of the day it's literallyimpossible to prevent this.  It's part and parcel of serving digitalcontent to universal machines owned and operated by other people - youlose all control over things like this.  As was posted previously,attempts to regain control with things like DRM just turn into armsraces that make life harder for legitimate users.  I'm in favour ofleaving things at a straightforward "gentleman's agreement".

> The problem with this method is that, let's say, there's a GUS server
> attempting to spider me on TCP 1965, but there's also some infernal HTTP
> < > Gemini proxy trying to access content on my Gemini servers from the
> same IP. I end up with an uncomfortable choice because I want to be
> indexed by GUS, but I don't want to allow anyone to use the World Wide
> Web to access my content.
> 
> 
>   A second one is to extend robots.txt to indicate proxying preference, or
> 
> some other file, but then there are multiple requests (or maybe
> 
> not---caching information could be included).  Extending robots.txt to do this seems fairly straightforward.  We couldintroduce "pseudo user-agents" like "proxy/*", "indexer/*", etc. whichall user agents of a particular type should respect.

Cheers,Solderpunk