💾 Archived View for gemi.dev › gemini-mailing-list › 000042.gmi captured on 2024-05-26 at 15:12:07. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-12-28)
-=-=-=-=-=-=-
Very cool! I want to express two points: one, I am glad to see this. As geminispace gets larger, a search method that doesn't rely on personal bookmarks will be really useful. I forget to bookmark gopher documents all the time, and Veronica really helps there. As for crawling in general, this is something that I've been thinking about lately, and I guess now it's timely: HTTP/S gemini proxies allow all of the public geminispace to be indexed by Google and other services. I think that we generally consider the two current proxies to be useful (I certainly do), so to a point this is unavoidable. proxy.vulpes.one has a robots.txt to prevent WWW crawling, which is a fair stopgap. portal.mozz.us does not, but I'm not angry about it. I meant to send out an e-mail to mozz to ask their opinion, but just haven't gotten around to it. My issue here is that the only 2 ways to opt out of Google indexing are to use a robots.txt, or to register yourself as the domain owner and control indexing via Google's console. Both of those methods apply to the proxy website's owner, not to the gemini server's owner, because each gemini server appears like a subdirectory on the proxy's domain. So the issue here is that the only way to opt out of being indexed is to contact each proxy maintainer and request that they make accommodations for you. That's fine with only 15 or so gemini servers, but not fair to proxy maintainers as gemini grows. It's also not enough to ask all proxies to use robots.txt, because there's nothing stopping someone from ignoring it either out of ignorance or in bad faith. Perhaps there isn't much that can be done here, and this e-mail is little more than me venting a concern. I realize that the only way to stay anonymous on the internet is to constantly maintain your anonymity. I'm doing that in some places. But my gemini server's domain name is already tied to my IRL identity, and I wish that it was at least harder for my gemini files to be on the 1st page of a Google result. This e-mail got a little long, and it's less formal than some of the other discussions. Sorry if it's bothersome. I'm not really a programmer, so I can't offer any solutions. But I wanted to throw this conversation into the general dialogue in case anyone has any thoughts or ideas here. Gemini is *not* the Web, after all. - m68k ---- On Wed, 26 Feb 2020 07:00:02 -0500 <gemini-request at lists.orbitalfox.eu> wrote ---- > One technical question is the issue of how server admins can opt out > of having their stuff crawled. GUS currently recognises a /robots.txt > resource with (I presume) identical syntax to that used for HTTP. > This is certainly one potential solution to the problem (and perhaps > the most sensible one), but we might want to consider others. >
It was thus said that the Great Andrew Kennedy once stated: > > So the issue here is that the only way to opt out of being indexed is to > contact each proxy maintainer and request that they make accommodations > for you. That's fine with only 15 or so gemini servers, but not fair to > proxy maintainers as gemini grows. It's also not enough to ask all proxies > to use robots.txt, because there's nothing stopping someone from ignoring > it either out of ignorance or in bad faith. There are other ways. One way is to recognize a proxy server and block any requests from it. I think it would be easy to recognize one because of all the requests from a single IP address (or block of IP addresses). The blocking can be at a firewall level, or the gemini server could recognice the IP (or IP block) and close the connection or return an error. That can be done now. A second one is to extend robots.txt to indicate proxying preference, or some other file, but then there are multiple requests (or maybe not---caching information could be included). Heck, even a DNS record (like a TXT RR with the contents "v=Gemini; proxy=no" with the TTL of the DNS record being honored). But that relies upon the good will of the proxy to honor that data. Or your idea of just asking could work just as well. -spc
On 2/26/2020 11:29 AM, Sean Conner wrote: > It was thus said that the Great Andrew Kennedy once stated: >> >> So the issue here is that the only way to opt out of being indexed is to >> contact each proxy maintainer and request that they make accommodations >> for you. That's fine with only 15 or so gemini servers, but not fair to >> proxy maintainers as gemini grows. It's also not enough to ask all proxies >> to use robots.txt, because there's nothing stopping someone from ignoring >> it either out of ignorance or in bad faith. > > There are other ways. One way is to recognize a proxy server and block > any requests from it. This is preferable to me, just blocking it at the firewall level, but does become administratively cumbersome as critical mass is acheived and a curated list of proxies isn't available - if someone does maintain such a list, it could just be popped into ipsets to keep the rulesets to a minimum. I don't want ANYONE being able to access any of my Gemini servers via a browser that doesn't support Gemini either natively, or via a plug-in. I've been quite vocal and adament about this in the Gopher community for well over a decade - to me, but not most folks apparently, it defeats the purpose of, and incentive to, develop unique content in Gopher/Gemini space, since someone is simply accessing it via HTTP anyway. The problem with this method is that, let's say, there's a GUS server attempting to spider me on TCP 1965, but there's also some infernal HTTP < > Gemini proxy trying to access content on my Gemini servers from the same IP. I end up with an uncomfortable choice because I want to be indexed by GUS, but I don't want to allow anyone to use the World Wide Web to access my content. > A second one is to extend robots.txt to indicate proxying preference, or > some other file, but then there are multiple requests (or maybe > not---caching information could be included). Ah yes, in a perfect world Sean :) > Heck, even a DNS record (like > a TXT RR with the contents "v=Gemini; proxy=no" with the TTL of the DNS > record being honored). But that relies upon the good will of the proxy to > honor that data. Again, in a perfect world ;) Either of these solutions (a TXT RR and/or utilizing robots.txt) are ideal, sans the concerns about extra traffic/requests. Right now everyone, for the most part, is on this list, and the good folks here are inclined to adhere to such friendly standards, but moving forward as adoption builds like a snowball rolling down the mountain, there will invariably be bad actors coming online. One consideration worth mentioning is that, at least in my case, I tend to have A and AAAA RRs point to a single host, and rely upon the listening ports to determine which protocols are used to serve the appropriate data. The way you suggested using the TXT RR would work fine in this case, however :) -- Bradley D. Thornton Manager Network Services http://NorthTech.US TEL: +1.310.421.8268
On 20/02/26 02:29PM, Sean Conner wrote: > A second one is to extend robots.txt to indicate proxying preference, or > some other file, but then there are multiple requests (or maybe > not---caching information could be included). Heck, even a DNS record (like > a TXT RR with the contents "v=Gemini; proxy=no" with the TTL of the DNS > record being honored). But that relies upon the good will of the proxy to > honor that data. > > Or your idea of just asking could work just as well. I'm of the opinion that either a robots.txt method or TXT record will do for preventing spiders/proxies, I feel that stronger than assuming good faith will always lead to an arms-war, and I'm not sure for the protocol the servers have any chance of winning a war against clients. If something must be kept private from proxys or spiders, perhaps requiring a client certificate might be for the best? I'm sure someone clever than I could figure out a way to require human intervention in creating a cert to access a page. -Steve
It was thus said that the Great Steve Ryan once stated: > On 20/02/26 02:29PM, Sean Conner wrote: > > A second one is to extend robots.txt to indicate proxying preference, or > > some other file, but then there are multiple requests (or maybe > > not---caching information could be included). Heck, even a DNS record (like > > a TXT RR with the contents "v=Gemini; proxy=no" with the TTL of the DNS > > record being honored). But that relies upon the good will of the proxy to > > honor that data. > > > > Or your idea of just asking could work just as well. > > I'm of the opinion that either a robots.txt method or TXT record will do > for preventing spiders/proxies, I feel that stronger than assuming good > faith will always lead to an arms-war, and I'm not sure for the protocol > the servers have any chance of winning a war against clients. To that end, I have a TXT record for gemini.conman.org. v=Gemini; proxy=no; webproxies=yes v=Gemini - TXT record for Gemini proxy=no - server does not support proxying requests proxy=yes - server does support proxying requests webproxies=no - please do not proxy this server via the web webproxies=yes - web proxying is okay Discussion, questions, concerns, etc. welcome. > If something must be kept private from proxys or spiders, perhaps > requiring a client certificate might be for the best? I'm sure someone > clever than I could figure out a way to require human intervention in > creating a cert to access a page. It's fairly easy, and I do have two directories that require client certificates: gemini://gemini.conman.org/private - any client certificate gemini://gemini.conman.org/conman-labs-private - particular client certificates required -spc
On Wed, Feb 26, 2020 at 07:54:35PM -0800, Bradley D. Thornton wrote: > This is preferable to me, just blocking it at the firewall level, but > does become administratively cumbersome as critical mass is acheived and > a curated list of proxies isn't available - if someone does maintain > such a list, it could just be popped into ipsets to keep the rulesets > to a minimum. I am happy to add something to the Best Practices document regarding HTTP proxies, which could include a polite request to inform me of proxies and their IP addresses so I can maintain a master list somewhere, as well as a strong admonition to serve a robots.txt which prevents web crawlers from slurping up Gemini content. > I don't want ANYONE being able to access any of my Gemini servers via a > browser that doesn't support Gemini either natively, or via a plug-in. > I've been quite vocal and adament about this in the Gopher community for > well over a decade - to me, but not most folks apparently, it defeats > the purpose of, and incentive to, develop unique content in > Gopher/Gemini space, since someone is simply accessing it via HTTP anyway. I understand this sentiment, but at the end of the day it's literally impossible to prevent this. It's part and parcel of serving digital content to universal machines owned and operated by other people - you lose all control over things like this. As was posted previously, attempts to regain control with things like DRM just turn into arms races that make life harder for legitimate users. I'm in favour of leaving things at a straightforward "gentleman's agreement". > The problem with this method is that, let's say, there's a GUS server > attempting to spider me on TCP 1965, but there's also some infernal HTTP > < > Gemini proxy trying to access content on my Gemini servers from the > same IP. I end up with an uncomfortable choice because I want to be > indexed by GUS, but I don't want to allow anyone to use the World Wide > Web to access my content. > > > A second one is to extend robots.txt to indicate proxying preference, or > > some other file, but then there are multiple requests (or maybe > > not---caching information could be included). Extending robots.txt to do this seems fairly straightforward. We could introduce "pseudo user-agents" like "proxy/*", "indexer/*", etc. which all user agents of a particular type should respect. Cheers, Solderpunk
---