πΎ Archived View for gemi.dev βΊ gemini-mailing-list βΊ 000531.gmi captured on 2023-12-28 at 15:48:46. Gemini links have been rewritten to link to archived content
β¬ οΈ Previous capture (2023-11-04)
-=-=-=-=-=-=-
I just developed a simple crawler for Gemini. Its goal is not to build another search engine but to perform some surveys of the geminispace. A typical result will be something like (real data, but limited in size): gemini://gemini.bortzmeyer.org/software/crawler/ Currently, I did not yet let it loose on the Internet, because there are some questions I have. Is it "good practice" to follow robots.txt? There is no mention of it in the specification but it could work for Gemini as well as for the Web and I notice that some programs query this name on my server. Since Gemini (and rightly so) has no User-Agent, how can a bot advertise its policy and a point of contact?
On Tue, 8 Dec 2020 14:36:56 +0100 Stephane Bortzmeyer <stephane at sources.org>: > I just developed a simple crawler for Gemini. Its goal is not to build > another search engine but to perform some surveys of the > geminispace. A typical result will be something like (real data, but > limited in size): > > gemini://gemini.bortzmeyer.org/software/crawler/ > > Currently, I did not yet let it loose on the Internet, because there > are some questions I have. > > Is it "good practice" to follow robots.txt? There is no mention of it > in the specification but it could work for Gemini as well as for the > Web and I notice that some programs query this name on my server. > > Since Gemini (and rightly so) has no User-Agent, how can a bot > advertise its policy and a point of contact? depending on what you try, you may add your contact info in the query. First contact with a new server before you start crawling you could get gemini://hostname/CRAWLER_FROM_SOMEONE_AT_HOST_DOT_COM This is what I do for a gopher connectivity check. I have to admit, it's a really poor solution but I didn't find better way.
> On Dec 8, 2020, at 14:58, Solene Rapenne <solene at perso.pw> wrote: > > depending on what you try, you may add your contact info > in the query. Like so? gemini://example.com/robots.txt?user-agent=DetailsOfTheRobot ? Previously on the list: gemini://gemi.dev/gemini-mailing-list/messages/000511.gmi
Yes, you should respect robots.txt in my opinion. It's not compulsory, but it's currently the best way we have to respect servers' wishes and bandwidth constraints. There is even a companion spec for doing so, which accompanies the main Gemini spec. gemini://gemini.circumlunar.space/docs/companion/robots.gmi Read the companion spec for more detail, but you're indeed correct that bots don't advertise who they are since there's no user-agent. Instead, we have some agreed-upon crawler categories, like `researcher`, `indexer`, `archiver`. It sounds like you may want to respect `researcher` and call it a day :) Nat On Tue, Dec 08, 2020 at 02:36:56PM +0100, Stephane Bortzmeyer wrote: > I just developed a simple crawler for Gemini. Its goal is not to build > another search engine but to perform some surveys of the > geminispace. A typical result will be something like (real data, but > limited in size): > > gemini://gemini.bortzmeyer.org/software/crawler/ > > Currently, I did not yet let it loose on the Internet, because there > are some questions I have. > > Is it "good practice" to follow robots.txt? There is no mention of it > in the specification but it could work for Gemini as well as for the > Web and I notice that some programs query this name on my server. > > Since Gemini (and rightly so) has no User-Agent, how can a bot > advertise its policy and a point of contact?
On Tue, Dec 08, 2020 at 09:47:57AM -0500, Natalie Pendragon <natpen at natpen.net> wrote a message of 32 lines which said: > Yes, you should respect robots.txt in my opinion. It's not compulsory, > but it's currently the best way we have to respect servers' wishes and > bandwidth constraints. It is interesting to note that some robots.txt are quite broken (see gemini://houston.coder.town/robots.txt)
On Tue, Dec 08, 2020 at 09:47:57AM -0500, Natalie Pendragon <natpen at natpen.net> wrote a message of 32 lines which said: > Yes, you should respect robots.txt in my opinion. It's not compulsory, > but it's currently the best way we have to respect servers' wishes and > bandwidth constraints. There is even a companion spec for doing so, > which accompanies the main Gemini spec. > > gemini://gemini.circumlunar.space/docs/companion/robots.gmi The spec is quite vague about the *order* of directives. For instance, <gemini://gempaper.strangled.net/robots.txt> is: User-agent: * Disallow: /credentials.txt User-agent: archiver Disallow: / The intented semantics is probably to disallow archivers but my parser regarded the site as available because it stopped at the first match, the star. Who is right? <gemini://gemini.circumlunar.space/docs/companion/robots.gmi> and <http://www.robotstxt.org/robotstxt.html> are unclear.
> On Dec 10, 2020, at 14:43, Stephane Bortzmeyer <stephane at sources.org> wrote: > > The spec is quite vague about the *order* of directives. Perhaps of interest: While by standard implementation the first matching robots.txt pattern always wins, Google's implementation differs in that Allow patterns with equal or more characters in the directive path win over a matching Disallow pattern. Bing uses the Allow or Disallow directive which is the most specific. In order to be compatible to all robots, if one wants to allow single files inside an otherwise disallowed directory, it is necessary to place the Allow directive(s) first, followed by the Disallow. http://en.wikipedia.org/wiki/Robots_exclusion_standard Also: https://developers.google.com/search/reference/robots_txt
I don't see anything in the spec saying to stop at first match. I think you should read the whole response and apply all lines that matches your virtual user agent. So in this case for an archiver, all lines.
On Thu, Dec 10, 2020 at 03:00:28PM +0100, Petite Abeille <petite.abeille at gmail.com> wrote a message of 24 lines which said: > Perhaps of interest: Not exactly the same thing since my email was about order of User-Agent (when there is both "*" and "archiver") but, yes, Robot Exclusion Standard is a mess. > In order to be compatible to all robots, if one wants to allow > single files inside an otherwise disallowed directory, it is > necessary to place the Allow directive(s) first, followed by the > Disallow. Note that Allow is not even standard. <http://www.robotstxt.org/robotstxt.html>: there is no "Allow" field.
On Thu, Dec 10, 2020 at 03:07:50PM +0100, C?me Chilliet <come at chilliet.eu> wrote a message of 4 lines which said: > I don't see anything in the spec saying to stop at first match. I think you should read the whole response and apply all lines that matches your virtual user agent. > So in this case for an archiver, all lines. Then this example in <http://www.robotstxt.org/robotstxt.html> would not work: User-agent: Google Disallow: User-agent: * Disallow: / Because with your algorithm, Google would be disallowed (while the comment in the page says "To allow a single robot[Google]").
> On Dec 10, 2020, at 15:09, Stephane Bortzmeyer <stephane at sources.org> wrote: > > Not exactly the same thing since my email was about order of > User-Agent (when there is both "*" and "archiver") but, yes, Robot > Exclusion Standard is a mess. Right. Perhaps best to look at how things are actually implemented in the wild :) Given your example -and a user agent of archiver- robots.txt Validator and Testing Tool at https://technicalseo.com/tools/robots-txt/ says Disallow.
> On Dec 10, 2020, at 15:15, Stephane Bortzmeyer <stephane at sources.org> wrote: > > Because with your algorithm, Google would be disallowed An empty Disallow means allow all, no?
On Thu, 10 Dec 2020 14:43:11 +0100 Stephane Bortzmeyer <stephane at sources.org> wrote: > The spec is quite vague about the *order* of directives. For instance, > <gemini://gempaper.strangled.net/robots.txt> is: > > User-agent: * > Disallow: /credentials.txt > User-agent: archiver > Disallow: / > > The intented semantics is probably to disallow archivers but my parser > regarded the site as available because it stopped at the first match, > the star. Who is right? According to the spec, lines beginning with "User-agent:" indicate a user agent to which subsequent lines apply My interpretation is that your example expresses: For any user agent: disallow access to /credentials.txt For archiver user agents: disallow access to / It is unclear from that specification alone whether the user agent applies to all Disallow lines after it, or only until the next User-agent line. The spec refers to the web standard for robot exclusion. In the web standard, you can think of 1+ User-agent lines followed by 1+ Allow/Disallow lines as a single record which specifies that all the the user agents should follow the Allow/Disallow rules that follow them. For example: User-agent: archiver User-agent: search engine Disallow /articles Disallow /uploads User-agent: something User-agent: someother Disallow /whatever This expresses: For user agent archiver or search engine: disallow access to /articles, disallow access to /uploads For user agent something or someother: disallow access to /whatever Refer to: http://www.robotstxt.org/norobots-rfc.txt The way "robots.txt for Gemini" specifies it is rather confusing. It's not indicated exactly how it differs from the web robot exclusion standard and taken alone it is ambiguous. -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201210/4885 ad20/attachment.sig>
On Thu, Dec 10, 2020 at 03:20:18PM +0100, Petite Abeille <petite.abeille at gmail.com> wrote a message of 8 lines which said: > > On Dec 10, 2020, at 15:15, Stephane Bortzmeyer <stephane at sources.org> wrote: > > > > Because with your algorithm, Google would be disallowed > > An empty Disallow means allow all, no? Yes, but the next Disallow add a restriction to everything.
On Thu, Dec 10, 2020 at 02:43:11PM +0100, Stephane Bortzmeyer <stephane at sources.org> wrote a message of 26 lines which said: > The spec is quite vague about the *order* of directives. Another example of the fact that you cannot rely on robots.txt: regexps. The official site <http://www.robotstxt.org/robotstxt.html> is crystal-clear: "Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines". But in the wild you find things like <gemini://drewdevault.com/robots.txt>: User-Agent: gus Disallow: /cgi-bin/web.sh?* Opinion: may be we should specify a syntax for Gemini's robots.txt, not relying on the broken Web one?
> On Dec 10, 2020, at 17:24, Stephane Bortzmeyer <stephane at sources.org> wrote: > > Yes, but the next Disallow add a restriction to everything. User-agent: Google Disallow: User-agent: * Disallow: / mwahahaha. a work of beauty indeed. this would seem to read as disallow everything, for everyone, but google. go figure.
On Thu, Dec 10, 2020 at 03:18:31PM +0100, Petite Abeille <petite.abeille at gmail.com> wrote a message of 16 lines which said: > Perhaps best to look at how things are actually implemented in the > wild :) I have a big disagreement with this approach. As a matter of principle (this approach allow big actors to set de facto standards and forcing the others into a race which was precisely what made the Web the bloated horror it is) and also because you cannot check every possible implementation, and, anyway, they disagree among them. So, no, I want a clear specification of what a crawler is supposed to do. > Given your example -and a user agent of archiver- robots.txt > Validator and Testing Tool at > https://technicalseo.com/tools/robots-txt/ says Disallow. I was not able to make it work. It keeps telling me that http://t.example/foo is "Invalid URL" and I find no way to enter an arbitrary User-Agent. And, anyway, it will not be an official test, just one implementation with some proprietary extensions.
> On Dec 10, 2020, at 17:42, Stephane Bortzmeyer <stephane at sources.org> wrote: > > So, no, I want a clear specification of what a crawler is supposed to do. Admirable :D This is perhaps why academics should not write software. Practically speaking :P Let's face it... there is none... it's all a gigantic sausage machine... haphazardly scotch-taped together... barely holding up, if at all... That said, circa 2004, this is what Mark Pilgrim had to say about the matter: Why Specs Matter https://pwn.ersh.io/notes/why_specs_matter/ Here is a copy for your enjoyment. It never gets old: Most developers are morons, and the rest are assholes. I have at various times counted myself in both groups, so I can say this with the utmost confidence. Assholes Assholes read specs with a fine-toothed comb, looking for loopholes, oversights, or simple typos. Then they write code that is meticulously spec-compliant, but useless. If someone yells at them for writing useless software, they smugly point to the sentence in the spec that clearly spells out how their horribly broken software is technically correct, and then they crow about it on their blogs. There is a faction of assholes that write test cases. These people are good to have around while writing a spec, because they can occasionally be managed into channeling their infinite time and energy into finding loopholes before the spec is final. Unfortunately, managing assholes is even harder and more time-consuming than it sounds. This is why writing good specs takes so long: most of the time is frittered away on asshole management. Morons Morons, on the other hand, don?t read specs until someone yells at them. Instead, they take a few examples that they find ?in the wild? and write code that seems to work based on their limited sample. Soon after they ship, they inevitably get yelled at because their product is nowhere near conforming to the part of the spec that someone else happens to be using. Someone points them to the sentence in the spec that clearly spells out how horribly broken their software is, and they fix it. Besides the run-of-the-mill morons, there are two factions of morons that are worth special mention. The first work from examples, and ship code, and get yelled at, just like all the other morons. But then when they finally bother to read the spec, they magically turn into assholes and argue that the spec is ambiguous, or misleading in some way, or ignoreable because nobody else implements it, or simply wrong. These people are called sociopaths. They will never write conformant code regardless of how good the spec is, so they can safely be ignored. The second faction of morons work from examples, ship code, and get yelled at. But when they get around to reading the spec, they magically turn into advocates and write up tutorials on what they learned from their mistakes. These people are called experts. Virtually every useful tutorial in the world was written by a moron-turned-expert. Angels Some people would argue that not all developers are morons or assholes, but they are mistaken. For example, some people posit the existence of what I will call the ?angel? developer. ?Angels? read specs closely, write code, and then thoroughly test it against the accompanying test suite before shipping their product. Angels do not actually exist, but they are a useful fiction to make spec writers to feel better about themselves. Why specs matter If your spec isn?t good enough, morons have no chance of ever getting things right. For everyone who complains that their software is broken, there will be two assholes who claim that it?s not. The spec, whose primary purpose is to arbitrate disputes between morons and assholes, will fail to resolve anything, and the arguments will smolder for years. If your spec is good enough, morons have a fighting chance of getting things right the second time around, without being besieged by assholes. Meanwhile, the assholes who have nothing better to do than look for loopholes won?t find any, and they?ll eventually get bored and wander off in search of someone else to harass. ?
On 12/10/20, Stephane Bortzmeyer <stephane at sources.org> wrote: > Opinion: may be we should specify a syntax for Gemini's robots.txt, > not relying on the broken Web one? Here it is: 'bots.txt' for gemini bots and crawlers. - know who you are: archiver, indexer, feed-reader, researcher etc. - ask for /bots.txt - if 20 text/plain then -- allowed = set() -- denied = set() -- split response by newlines, for each line --- split by spaces and tabs into fields ---- paths = field[0] split by ',' ---- if fields[2] is "allowed" and you in field[1] split by ',' then allowed = allowed union paths ----- if field[3] is "but" and field[5] is "denied" and you in field[4] split by ',' then denied = denied union paths ---- if fields[2] is "denied" and you in field[1] split by ',' then denied = denied union paths you always match all, never match none union of paths is special: { "/a/b" } union { "/a/b/c" } ==> { "/a/b" } when you request for path, find the longest match from allowed and denied; if it is in allowed you're allowed, otherwise not;; when a tie: undefined behaviour, do what you want. examples: default, effectively: / all allowed or / none denied complex example: /priv1,/priv2,/login all denied /cgi-bin indexer allowed but archiver denied /priv1/pub researcher allowed but blabla,meh,heh,duh denied what do you think?
December 10, 2020 8:43 AM, "Stephane Bortzmeyer" <stephane at sources.org> wrote: > - snip - > > The spec is quite vague about the *order* of directives. For instance, > <gemini://gempaper.strangled.net/robots.txt> is: > > User-agent: * > Disallow: /credentials.txt > User-agent: archiver > Disallow: / > > The intented semantics is probably to disallow archivers but my parser > regarded the site as available because it stopped at the first match, > the star. Who is right? Not you. The idea is that you start with the most direct User-Agent that applies to you (in this case, `archiver`), and then if that doesn't say you can't access the file, go up a level (in this case, `*`), and if
On Tue, Dec 08, 2020 at 03:05:42PM +0100, Petite Abeille <petite.abeille at gmail.com> wrote a message of 14 lines which said: > Like so? > > gemini://example.com/robots.txt?user-agent=DetailsOfTheRobot ? Good idea, this is what I do now.
On Thu, Dec 10, 2020 at 11:37:34PM +0530, Sudipto Mallick <smallick.dev at gmail.com> wrote a message of 40 lines which said: > 'bots.txt' for gemini bots and crawlers. Interesting. The good thing is that it moves away from robots.txt (underspecified, full of variants, impossible to know what a good bot should do). > - know who you are: archiver, indexer, feed-reader, researcher etc. > - ask for /bots.txt > - if 20 text/plain then > -- allowed = set() > -- denied = set() > -- split response by newlines, for each line > --- split by spaces and tabs into fields > ---- paths = field[0] split by ',' > ---- if fields[2] is "allowed" and you in field[1] split by ',' then > allowed = allowed union paths > ----- if field[3] is "but" and field[5] is "denied" and you in > field[4] split by ',' then denied = denied union paths > ---- if fields[2] is "denied" and you in field[1] split by ',' then > denied = denied union paths > you always match all, never match none > union of paths is special: > { "/a/b" } union { "/a/b/c" } ==> { "/a/b" } > > when you request for path, find the longest match from allowed and > denied; if it is in allowed you're allowed, otherwise not;; when a > tie: undefined behaviour, do what you want. It seems perfect.
On Thu, Dec 10, 2020 at 09:44:50PM +0000, Robert "khuxkm" Miles <khuxkm at tilde.team> wrote a message of 24 lines which said: > Not you. The idea is that you start with the most direct User-Agent > that applies to you (in this case, `archiver`), and then if that > doesn't say you can't access the file, go up a level (in this case, > `*`), Reasonable interpretation (more-specific to less-specific). Too bad the "standard" is so vague.
On Thu, Dec 10, 2020 at 11:37:34PM +0530, Sudipto Mallick <smallick.dev at gmail.com> wrote a message of 40 lines which said: > - ask for /bots.txt Speaking of this, I suggest it could be better to have a /.well-known (or equivalent) to put all these "meta" files. The Web does it (RFC 5785) and it's cool since it avoids colliding with "real" resources. (Also, crawling the geminispace shows strange robots.txt which are probably "wildcards" or "catchall", created by a program which replies for every possible path. Having a /.well-known would allow to define an exception.) It requires no change in clients (except bots) or servers, it is just a convention. => gemini://gemini.bortzmeyer.org/rfc-mirror/rfc5785.txt RFC 5785 "Defining Well-Known URIs" Meta-remark: is there a place with all the "Gemini good practices" or "Gemini conventions", which do not change the protocol or the format but are useful?
Le vendredi 11 d?cembre 2020, 09:26:54 CET Stephane Bortzmeyer a ?crit : > > - know who you are: archiver, indexer, feed-reader, researcher etc. > > - ask for /bots.txt > > - if 20 text/plain then > > -- allowed = set() > > -- denied = set() > > -- split response by newlines, for each line > > --- split by spaces and tabs into fields > > ---- paths = field[0] split by ',' > > ---- if fields[2] is "allowed" and you in field[1] split by ',' then > > allowed = allowed union paths > > ----- if field[3] is "but" and field[5] is "denied" and you in > > field[4] split by ',' then denied = denied union paths > > ---- if fields[2] is "denied" and you in field[1] split by ',' then > > denied = denied union paths > > you always match all, never match none > > union of paths is special: > > { "/a/b" } union { "/a/b/c" } ==> { "/a/b" } > > > > when you request for path, find the longest match from allowed and > > denied; if it is in allowed you're allowed, otherwise not;; when a > > tie: undefined behaviour, do what you want. > > It seems perfect. I guess I?m not the only one needing some examples to fully understand how this would work? If I get it it?s something like so: path1,path2 archiver,crawler allowed but path3 denied path4 * denied
> On Dec 11, 2020, at 11:16, Stephane Bortzmeyer <stephane at sources.org> wrote: > > I suggest it could be better to have a /.well-known +1 for the /.well-known convention. This was mentioned several time previously, but inertia is strong with that one. Go figure.
On Fri, Dec 11, 2020 at 11:18:00AM +0100, C?me Chilliet <come at chilliet.eu> wrote a message of 33 lines which said: > I guess I?m not the only one needing some examples to fully understand how this would work? There were examples at the end of the original message.
On Fri, Dec 11, 2020 at 11:25:31AM +0100, Petite Abeille <petite.abeille at gmail.com> wrote a message of 11 lines which said: > This was mentioned several time previously, but inertia is strong > with that one. Go figure. This is one of the interesting things with Gemini: it is a social experience, more than a technical one. I love observing how Gemini governance works (or fails).
> On Dec 11, 2020, at 11:31, Stephane Bortzmeyer <stephane at sources.org> wrote: > > This is one of the interesting things with Gemini: it is a social > experience, more than a technical one. I love observing how Gemini > governance works (or fails). Huis clos meets Lord of the Flies? :D
Le vendredi 11 d?cembre 2020, 11:16:05 CET Stephane Bortzmeyer a ?crit : > Meta-remark: is there a place with all the "Gemini good practices" or > "Gemini conventions", which do not change the protocol or the format > but are useful? Some are listed in gemini://gemini.circumlunar.space/docs/ So I?d expect this page to regroup all validated specifications, and best/common practices/conventions. I also started gemini://gemlog.lanterne.chilliet.eu/specifications.en.gmi to list actual specifications and proposals without mixing them with best practice documents. C?me
what i wrote was a rough algorithm, now here is a human readable description for bots.txt: every line has the following format: <paths> <bots> ("allowed" | "denied") OR <paths> <bots> "allowed" "but" <bots> "denied" <paths> is comma seperated paths to be allowed or denied <bots> is comma seperated bot ''descriptors'' (think of better word for this) matching [A-Za-z][A-Za-z0-9_-]*
> for example: > > /a,/p all denied > /a/b,/p/q indexer,researcher allowed > /a/b/c researcher denied > /a/b/c heh allowed > now the researcher 'heh' may access /p/q/* and and /a/b/* > and it may not access /a/b/{X} when {X} != 'c' err. sorry. that should be: may not access /a/{X} when {X} != 'b' and /p/{Y} when {Y} != 'q' (for all indexers and researchers, hmm.) every one other that researchers and indexers may not access /a/* and /p/* > other researchers may only access /p/q and /a/b/{Z} when {Z} != 'c' so > they may not access /a/b/c indexers may access /a/b/* and /p/q/* ah.
(Sorry if this is the wrong place to reply.) Why are we defining new standards and filenames? bots.txt, .well-know, etc. We don't need this. Gemini is based around the idea of radical familiarity. Creating a new robots standard breaks that, and makes things more complicated. There are existing complete robots.txt standards, are there not? I admit I'm not well-versed in this, but let's just pick a standard that works and make it offical. After doing some quick research, I found that Google has submitted a draft spec for robots.txt to the IETF. The original draft was submitted on July 07, 2019, and the most recent draft was submitted ~3 days ago, on the 8th. https://developers.google.com/search/reference/robots_txt https://tools.ietf.org/html/draft-koster-rep-04 I am no big fan of Google, but they are the kings of crawling and it makes sense to go with them here. The spec makes many example references to HTTP, but note that it is fully protocol-agnostic, so it works fine for Gemini. makeworld
> On Dec 11, 2020, at 21:38, colecmac at protonmail.com wrote: > > We don't need this. hmmm? who are you again? rhetorical.
> Why are we defining new standards and filenames? bots.txt, .well-known, > etc. Just want to point out that the .well-known path for machine-readable data about an origin is a proposed IETF standard that has relatively widespread use today. It is filed under RFC8615, and is definitely not a standard that was invented in this thread. The first paragraph of the introduction even references the robots file. While I don't necessarily agree with the naming of bots.txt, I see no problem with putting these files under a .well-known directory. > We don't need this. Thanks for making this mailing list a lot more efficient and talking about what the Gemini community needs in a 4-word sentence. Even if the original path of /robots.txt is kept, I think it makes sense to clarify an algorithm in non-ambiguous steps in order to get rid of the disagreements in edge-cases. > let's just pick a standard that works and make it offical. The point is that the standard works for simple cases, but leaves a lot to be desired when it comes to clarifying more complex cases. This results in a lot of robots.txt implementations disagreeing about what is allowed and not allowed. Additionally by crawling the Web, you can see that people tend to extend robots.txt in non-standard ways and this only gets incorporated into Google's crawlers if the website is important enough. > I am no big fan of Google, but they are the kings of crawling and it > makes sense to go with them here. The kings of crawling deemed HTTP and AMP the most suitable protocol and markup format for content to be crawled, why don't we stop inventing standards like Gemini and Gemtext and go with them here. > The spec makes many example references to HTTP, but note that it is > fully protocol-agnostic, so it works fine for Gemini. Gemtext spec makes references to Gemini, but it is fully protocol-agnostic, so it works fine with HTTP. Similarly, Gemini makes many references to Gemtext, but it is content-type agnostic so it works fine with HTML. But we thought we could be better off shedding historical baggage and reinvented not one, but two main concepts of the traditional Web. -- Leo
On Fri, Dec 11, 2020 at 09:20:25AM +0100, Stephane Bortzmeyer <stephane at sources.org> wrote a message of 10 lines which said: > > Like so? > > > > gemini://example.com/robots.txt?user-agent=DetailsOfTheRobot ? > > Good idea, this is what I do now. Note that it is not guaranteed to work because of broken (IMHO) servers. For instance, <gemini://alexschroeder.ch/robots.txt?robot=true> redirects to <gemini://alexschroeder.ch/page/robots.txt?robot=true> which returns a code 50 :-(
On Fri, Dec 11, 2020 at 08:38:12PM +0000, colecmac at protonmail.com <colecmac at protonmail.com> wrote a message of 29 lines which said: > Gemini is based around the idea of radical familiarity. Creating a > new robots standard breaks that, and makes things more > complicated. There are existing complete robots.txt standards The problem is precisely the final S. I can add that most of the "standards" are poorly written and very incomplete. > but let's just pick a standard that works and make it offical. OK, that's fine with me. > https://tools.ietf.org/html/draft-koster-rep-04 (Better to indicate the URL without the version number, to get the latest version.) It seems well-specified, although quite rich, so more difficult to implement.
> On Dec 12, 2020, at 15:28, Stephane Bortzmeyer <stephane at sources.org> wrote: > > Note that it is not guaranteed to work because of broken (IMHO) > servers. For instance, > <gemini://alexschroeder.ch/robots.txt?robot=true> redirects to > <gemini://alexschroeder.ch/page/robots.txt?robot=true> which returns a > code 50 :-( Bummer. But to be expected. Specially from Alex who likes to, hmmm, experiment wildly :) Still seems to be a reasonable approach. Sean mentioned using #fragment to the same effect.
On Sat, Dec 12, 2020 at 05:52:39PM +0100, Petite Abeille <petite.abeille at gmail.com> wrote a message of 17 lines which said: > But to be expected. I had to add blacklist support in my crawler to explicitely exclude some... creative capsules. > Sean mentioned using #fragment to the same effect. That would be wrong since URI fragments are not sent to the server, unlike queries.
> On Dec 11, 2020, at 21:38, colecmac at protonmail.com wrote: > > I am no big fan of Google, but they are the kings of crawling and it makes sense > to go with them here. Interesting read: Google?s Got A Secret https://knuckleheads.club TL;DR: There Should Be A Public Cache Of The Web
On Saturday, December 12, 2020 4:46 AM, Leo <list at gkbrk.com> wrote: > > Why are we defining new standards and filenames? bots.txt, .well-known, > > etc. > > Just want to point out that the .well-known path for machine-readable > data about an origin is a proposed IETF standard that has relatively > widespread use today. It is filed under RFC8615, and is definitely not a > standard that was invented in this thread. Yes, I'm aware of .well-known. I meant that using it to hold a robots.txt-type file would be a new filepath. > The first paragraph of the introduction even references the robots file. > > While I don't necessarily agree with the naming of bots.txt, I see no > problem with putting these files under a .well-known directory. My only problem was that it would be reinventing something that already exists, but I likewise have no issue with the idea of .well-known in general. > > We don't need this. > > Thanks for making this mailing list a lot more efficient and talking > about what the Gemini community needs in a 4-word sentence. Sorry, perhaps that was too curt. I don't intend to speak for everyone, it's only my opinion. However it wasn't just a 4-word sentence, I backed up my opinion with the rest of my email. > Even if the original path of /robots.txt is kept, I think it makes sense > to clarify an algorithm in non-ambiguous steps in order to get rid of > the disagreements in edge-cases. > > > let's just pick a standard that works and make it offical. > > The point is that the standard works for simple cases, but leaves a lot > to be desired when it comes to clarifying more complex cases. This > results in a lot of robots.txt implementations disagreeing about what is > allowed and not allowed. I agree, that's why I was trying to pick a standard instead of developing our own. Picking a well-written standard will cover all these cases. > Additionally by crawling the Web, you can see that people tend to extend > robots.txt in non-standard ways and this only gets incorporated into > Google's crawlers if the website is important enough. Ok? I don't see how that's relevant. By picking a standard here and sticking to it, we can avoid that. > > I am no big fan of Google, but they are the kings of crawling and it > > makes sense to go with them here. > > The kings of crawling deemed HTTP and AMP the most suitable protocol and > markup format for content to be crawled, why don't we stop inventing > standards like Gemini and Gemtext and go with them here. Again I don't see how that's relevant. Yes "we" don't like AMP here, yes we don't like HTTP, etc. This doesn't make the robots.txt standard I sent a bad one. > > The spec makes many example references to HTTP, but note that it is > > fully protocol-agnostic, so it works fine for Gemini. > > Gemtext spec makes references to Gemini, but it is fully > protocol-agnostic, so it works fine with HTTP. Similarly, Gemini makes > many references to Gemtext, but it is content-type agnostic so it works > fine with HTML. But we thought we could be better off shedding > historical baggage and reinvented not one, but two main concepts of the > traditional Web. > This quote, along with this line: > why don't we stop inventing standards like Gemini and Gemtext and go with > them [Google] here. makes me think you misunderstand some of Gemini's ideas. Solderpunk has talked about the idea of "radical familiarity" on here, and how Gemini uses well known (ha) protocols as much as possible, like TLS, media types, URLs, etc. Gemini tries to *avoid* re-inventing! Obviously the protocol itself and gemtext are the exceptions to this, but otherwise, making niche protocols only for Gemini is not the way to go (in my opinion). It makes things harder to implement and understand for developers. Cheers, makeworld
---