Hi gemilist (listini?), I've got a minor quibble with the spec, section 2, paragraph ... 3(?), which I'll quote here. > <URL> is a UTF-8 encoded absolute URL, of maximum length 1024 bytes. If the scheme of the URL is not specified, a scheme of gemini:// is implied. Specifically, the "scheme of gemini:// is implied" clause is confusing. According to the URL spec (https://tools.ietf.org/html/rfc3986), > The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI. Meaning that the scheme does not, in fact, include a "//" at the end, but rather that "//" is a separator between the scheme and the authority. In fact, to actually encode a scheme-agnostic URL in a link, an author needs to write "//example.com/path". For an example, see the links in flounder.online. I bring this issue up because there have been instances of geminauts linking like this: => example.com/path An example link Which resolves, not to gemini://example.com/path, but to ./example.com/path on the current server. To resolve this confusion, I propose is to either (a) strip the "//" (and probably ":", though I found no particular reference to it in the spec) from the "scheme of gemini:// is implied" portion of the above paragraph, or (b) remove the scheme bit altogether. I personally prefer this because it's maximally precise. I'd love to hear your thoughts on the matter. -- ~ acdw acdw.net | breadpunk.club/~breadw
I think you're confusing what that section is talking about. I believe it is referring to sending a URL for a request only. It's saying, "when you make a request, you can leave the gemini:// part out". I don't think it speaks to links in documents at all, which are governed by the URL RFC. I agree it's confusing however, because of the use of the word "scheme", while also including the colon and slashes. I think it's totally fine that those characters can be left off in the request, but this line should be more clear. How about saying: > If the URL does not begin with `gemini://`, then that prefix is implied. > Leaving off just the `gemini:` portion and starting with `//` also implies > the gemini scheme, in accordance with the URL spec. That might be too wordy, and perhaps requiring that all request URLs have a // would be better. But I don't want to break backwards compatibility. makeworld
On Mon, 16 Nov 2020 23:39:19 +0000 acdw <acdw at acdw.net> wrote: > > The authority component is preceded by a double slash ("//") and is > > terminated by the next slash ("/"), question mark ("?"), or number > > sign ("#") character, or by the end of the URI. > > Meaning that the scheme does not, in fact, include a "//" at the end, > but rather that "//" is a separator between the scheme and the > authority. In fact, to actually encode a scheme-agnostic URL in a > link, an author needs to write "//example.com/path". For an example, > see the links in flounder.online. > > I bring this issue up because there have been instances of geminauts > linking like this: > > => example.com/path An example link > > Which resolves, not to gemini://example.com/path, but > to ./example.com/path on the current server. This is wrong, even by web standards, when referencing to a different host, one must explicitly write a valid URL, you DON'T see: > <a href="example.tld/index.html"></a> > To resolve this confusion, I propose is to either > > (a) strip the "//" (and probably ":", though I found no particular > reference to it in the spec) from the "scheme of gemini:// is > implied" portion of the above paragraph, or In my humble opinion, I think that "//example.tld/" is an implementation specific hack and has no place in the protocol, a URI like that is invalid and should not be respected by servers, what should actually work is providing authority and path like so: "example.tld/path/", this is discouraged by RFC 3986 (section 4.5) but it actually makes sense if context is defined, in this case, context is gemini so a scheme of gemini is implied. Also, this is the default behavior for web browsers implying scheme of http(s), which I think is acceptable and convenient behavior, so I agree with you on that, assuming that's what you meant. > (b) remove the scheme bit altogether. I personally prefer this > because it's maximally precise. The scheme bit in requests allows for proxies to work, for example, when I host a proxy instance at "gemini://raiz.proxy/" someone sends a request of "https://example.tld/", my proxy can fetch the page and send it back to the client through gemini, I think that's why it's there. Perhaps there are many many other use cases for this that I haven't thought of.
It was thus said that the Great acdw once stated: > I've got a minor quibble with the spec, section 2, paragraph ... 3(?), > which I'll quote here. > > > <URL> is a UTF-8 encoded absolute URL, of maximum length 1024 bytes. If > > the scheme of the URL is not specified, a scheme of gemini:// is > > implied. [ snip ] > To resolve this confusion, I propose is to either > > (a) strip the "//" (and probably ":", though I found no particular > reference to it in the spec) from the "scheme of gemini:// is implied" > portion of the above paragraph, or > > (b) remove the scheme bit altogether. I personally prefer this because > it's maximally precise. > > I'd love to hear your thoughts on the matter. This has come up before [1][2], and as I have stated [3][4], the '//' is considered part of the host (or at least, a marker for the host portion of a URL) and thus, I think the wording of section 2 should be changed to read <URL> is a UTF-8 encoded absolute URL, of maximum length 1024 bytes. If the scheme of the URL is not specified, a scheme of gemini: is implied. -spc [1] https://lists.orbitalfox.eu/archives/gemini/2020/001006.html [2] https://lists.orbitalfox.eu/archives/gemini/2020/002954.html [3] https://lists.orbitalfox.eu/archives/gemini/2020/001009.html [4] https://lists.orbitalfox.eu/archives/gemini/2020/002964.html
It was thus said that the Great Ali Fardan once stated: > On Mon, 16 Nov 2020 23:39:19 +0000 > acdw <acdw at acdw.net> wrote: > > > The authority component is preceded by a double slash ("//") and is > > > terminated by the next slash ("/"), question mark ("?"), or number > > > sign ("#") character, or by the end of the URI. > > > > Meaning that the scheme does not, in fact, include a "//" at the end, > > but rather that "//" is a separator between the scheme and the > > authority. In fact, to actually encode a scheme-agnostic URL in a > > link, an author needs to write "//example.com/path". For an example, > > see the links in flounder.online. > > > > I bring this issue up because there have been instances of geminauts > > linking like this: > > > > => example.com/path An example link > > > > Which resolves, not to gemini://example.com/path, but > > to ./example.com/path on the current server. > > This is wrong, even by web standards, when referencing to a different > host, one must explicitly write a valid URL, you DON'T see: > > > <a href="example.tld/index.html"></a> > > > To resolve this confusion, I propose is to either > > > > (a) strip the "//" (and probably ":", though I found no particular > > reference to it in the spec) from the "scheme of gemini:// is > > implied" portion of the above paragraph, or > > In my humble opinion, I think that "//example.tld/" is an > implementation specific hack and has no place in the protocol, a URI > like that is invalid and should not be respected by servers, It *is* allowed though---it's a schemeless URI and in a given context, it can be inferred. Check out RFC-3986 section 5.2.2 (Transforming Rreferences, aka, resolving a URL with a base URL) and section 5.3 (Component Recomposision) where ':' is appended to the scheme, and '//' is prefixed to the authority (host) section. So, given a URL like this: //example.net/path/to/resource in a resource, if the resource was served up via HTTP, then the scheme is 'http:'; if HTTPS, then 'https:' and if gemini, 'gemini:'. A URL like this: example.net/path/to/resource is, again, per RFC-3986 parsing rules, to be interpreted as a path, not an authority section then path. Need I create an example to show this? I can. -spc
On Mon, 16 Nov 2020 21:19:16 -0500 Sean Conner <sean at conman.org> wrote: > It *is* allowed though---it's a schemeless URI and in a given > context, it can be inferred. Check out RFC-3986 section 5.2.2 > (Transforming Rreferences, aka, resolving a URL with a base URL) and > section 5.3 (Component Recomposision) where ':' is appended to the > scheme, and '//' is prefixed to the authority (host) section. > > So, given a URL like this: > > //example.net/path/to/resource > > in a resource, if the resource was served up via HTTP, then the > scheme is 'http:'; if HTTPS, then 'https:' and if gemini, 'gemini:'. > > A URL like this: > > example.net/path/to/resource > > is, again, per RFC-3986 parsing rules, to be interpreted as a path, > not an authority section then path. Need I create an example to show > this? I can. You are correct.
Heya! > It *is* allowed though---it's a schemeless URI and in a given context, it > can be inferred. Check out RFC-3986 section 5.2.2 (Transforming > Rreferences, aka, resolving a URL with a base URL) and section 5.3 > (Component Recomposision) where ':' is appended to the scheme, and '//' is > prefixed to the authority (host) section. > > So, given a URL like this: > > //example.net/path/to/resource > > in a resource, if the resource was served up via HTTP, then the scheme is > 'http:'; if HTTPS, then 'https:' and if gemini, 'gemini:'. I'm using this on gemini sites that are also hosted in web space. This allows cross-server linking without changing protocol, it's very convenient. > A URL like this: > > example.net/path/to/resource > > is, again, per RFC-3986 parsing rules, to be interpreted as a path, not an > authority section then path. Need I create an example to show this? I can. Exactly.
On Tue, 17 Nov 2020 04:47:45 +0300 Ali Fardan <raiz at stellarbound.space> wrote: > In my humble opinion, I think that "//example.tld/" is an > implementation specific hack and has no place in the protocol, a URI > like that is invalid and should not be respected by servers, what > should actually work is providing authority and path like so: > "example.tld/path/", this is discouraged by RFC 3986 (section 4.5) but > it actually makes sense if context is defined, in this case, context is > gemini so a scheme of gemini is implied. With respect to RFC3986, it's not a matter of opinion. It's very much not an implementation specific hack. It's defined in RFC 3986 as "relative-ref", a "network-path reference" specifically. Non-URIs of the "example.com/hello" style on the other hand are an implementation specific hack, as you've noted, discouraged by RFC 3986 and not specified in any of the syntaxes it defines. It's obviously unsuitable for links because it's ambiguous with relative-ref. Gemini however explicitly only allows "absolute URL" in requests. It also says that "If the scheme of the URL is not specified, a scheme of gemini:// is implied." In terms of RFC 3986, this is nonsense. "gemini://" isn't the scheme. "gemini" is the scheme, "//" is the beginning of hier-part or relative-part, and ":" separates the scheme from hier-part. I've previously called for clarification on this point. One might read that last sentence as requests by suffix references are allowed, (which is what you get when you omit "gemini://") or that some relative-ref are allowed (which is what you get if you literally omit the scheme and scheme separator). I'd prefer if the spec could just refer to an expected syntax as defined in RFC 3986. This would reduce confusion significantly. Skip all the hacks and allow only e.g. the URI syntax (which does not include relative-ref) for requests and URI-reference syntax (which includes URI and relative-ref) for links. Adopt the language of RFC 3986 to describe them. Last I checked, if you connect to gemini://gemini.circumlunar.space and request "gemini.circumlunar.space/" you get an error. You may however request "//gemini.circumlunar.space/" and get the appropriate 20 response. Should gemini.circumlunar.space be considered to be running a canonical implementation of Gemini? -- Philip
On Mon, 16 Nov 2020 21:07:54 -0500 Sean Conner <sean at conman.org> wrote: > This has come up before [1][2], and as I have stated [3][4], the '//' is > considered part of the host (or at least, a marker for the host portion of a > URL) and thus, I think the wording of section 2 should be changed to read > > <URL> is a UTF-8 encoded absolute URL, of maximum length 1024 bytes. > If the scheme of the URL is not specified, a scheme of gemini: is > implied. > "gemini:" is not a valid scheme. ":" is part of the URI and absolute-URI syntaxes defined in RFC 3986, not the scheme. The spec should be able to express any sensible acceptable URI syntax in terms of the syntaxes and terminology defined in RFC 3986. There's no need to add weird exceptions outside RFC 3986 that aren't already covered in it. For example, the spec can read: "<URL> is an UTF-8 encoded URI or network-path reference as defined in RFC 3986" (requests) and "<URL> is an UTF-8 encoded URI-reference as defined in RFC 3986" (links). If we want requests like "gemini.circumlunar.space/" to be valid, it can additionally read that <URL> allows suffix references. To call it an "absolute URL" is especially concerning since Gemini apparently allows fragments, but RFC 3986 defines an "absolute-URI" syntax which does not. -- Philip
On Tue, 17 Nov 2020 10:19:52 +0100 Philip Linde <linde.philip at gmail.com> wrote: > With respect to RFC3986, it's not a matter of opinion. > > It's very much not an implementation specific hack. It's defined in > RFC 3986 as "relative-ref", a "network-path reference" specifically. > Non-URIs of the "example.com/hello" style on the other hand are an > implementation specific hack, as you've noted, discouraged by RFC 3986 > and not specified in any of the syntaxes it defines. It's obviously > unsuitable for links because it's ambiguous with relative-ref. I don't know about that, section 3.2 states that authority should be preceded by a "//", not that it is a part of the authority component, also, the ABNF representation has no "//" in it. Suffix references (section 4.5) are only discouraged because of possible misinterpretation, however in the case of Gemini requests, people can write their code to handle them just like they write their code to handle "//example.tld", it's not that hard and looks much much cleaner, the argument that it could be interpreted as path should also apply for "//example.tld" too, because it could be interpreted as a path too, however if the author decided to handle such case, it'll be handled just fine, you can have your parser treat the text before the first occurrence of '/' as host subcomponent of authority component if scheme is not specified just like you have your parser treat the first occurrence of '/' after the "//" prefix as host subcomponent in the current way of handling schemeless requests in Gemini, the Gemini protocol requires passing full URL in requests, therefore, such should not be interpreted as path because Gemini requests don't allow path without stating host. So yeah, I'm not changing my mind, "//example.tld" is a hack because that is not a valid URI and "//" is supposed to be only present when scheme is specified, however, "example.tld" is while discouraged, acceptable for this use case and the RFC even acknowledged it. Let me quote to you why it is that RFC 3986 discourages its use: > Although this practice of using suffix references is common, it > should be avoided whenever possible and should never be used in > situations where long-term references are expected. In the case of Gemini requests, they are not a 'long-term' reference, they're one-time requests, I don't see any downside to not doing it. > Last I checked, if you connect to gemini://gemini.circumlunar.space > and request "gemini.circumlunar.space/" you get an error. You may > however request "//gemini.circumlunar.space/" and get the appropriate > 20 response. Should gemini.circumlunar.space be considered to be > running a canonical implementation of Gemini? You shouldn't look at any particular implementation as a reference for the spec, I'm assuming gemini.circumlunar.space is running molly-brown, do you know that molly-brown treats single '\n' as valid request terminators instead of explicit '\r\n'? (see: https://tildegit.org/solderpunk/molly-brown/src/branch/master/handler.go#L138), do you know that if a transaction is finished, molly-brown waits for the client to close the connection instead of closing it from the server side, is that spec compliant? The reason I think molly-brown accepted "//example.tld" in the first place is because the Go standard library URL parser implementation accepted this, I don't know if this was a bug or it is intended design, but that's what it is, other URI parsers that are more strict with compliance to the RFC will refuse to parse a URI without scheme present, here is an excerpt from the library's documentation that might give an idea of how they treat URLs: > A URL represents a parsed URL (technically, a URI reference). > > The general form represented is: > > [scheme:][//[userinfo@]host][/]path[?query][#fragment] > > URLs that do not start with a slash after the scheme are > interpreted as: > > scheme:opaque[?query][#fragment] Notice that [scheme:] is enclosed in brackets implying that it is optional, while [//host] is optional too, the "//" is considered a part of the authority component by the Go URL parser implementation, this is why "//example.tld" is accepted while "example.tld" is not, try passing both strings to url.Parse() and see what you get.
It was thus said that the Great Ali Fardan once stated: > On Tue, 17 Nov 2020 10:19:52 +0100 > Philip Linde <linde.philip at gmail.com> wrote: > > With respect to RFC3986, it's not a matter of opinion. > > > > It's very much not an implementation specific hack. It's defined in > > RFC 3986 as "relative-ref", a "network-path reference" specifically. > > Non-URIs of the "example.com/hello" style on the other hand are an > > implementation specific hack, as you've noted, discouraged by RFC 3986 > > and not specified in any of the syntaxes it defines. It's obviously > > unsuitable for links because it's ambiguous with relative-ref. > > I don't know about that, section 3.2 states that authority should be > preceded by a "//", not that it is a part of the authority component, > also, the ABNF representation has no "//" in it. > > Suffix references (section 4.5) are only discouraged because of > possible misinterpretation, however in the case of Gemini requests, > people can write their code to handle them just like they write their > code to handle "//example.tld", it's not that hard and looks much much > cleaner, the argument that it could be interpreted as path should also > apply for "//example.tld" too, because it could be interpreted as a ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > path too, however if the author decided to handle such case, it'll be ^^^^^^^^^ Citation needed. I'm sorry, this just isn't the case. From the full ABNF in Appendix A: URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty URI-reference = URI / relative-ref absolute-URI = scheme ":" hier-part [ "?" query ] relative-ref = relative-part [ "?" query ] [ "#" fragment ] relative-part = "//" authority path-abempty / path-absolute / path-noscheme / path-empty [ NON-PATH RELATED RULES OMITTED FOR SPACE I REPEAT NON-PATH RELATED RULES OMITTED FOR SPACE ] path = path-abempty ; begins with "/" or is empty / path-absolute ; begins with "/" but not "//" / path-noscheme ; begins with a non-colon segment / path-rootless ; begins with a segment / path-empty ; zero characters path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) path-empty = 0<pchar> The path parsing rules state a single slash. Not '/'+, nor '/'*, but a single '/'. The only place where more than a single slash is allowed PER THE @#%@#$@$ ABNF is just prior to the authority, which contains the hostname. THE ONLY PLACE! I will also draw your attention to the URI-reference rule, which is there for some reason, which allows both a full URI, or a RELATIVE URI, which means that //example.com/path/to/resource IS A VALID URI! IT IS NOT A HACK! What part of the ABNF do you not understand? > handled just fine, you can have your parser treat the text before the > first occurrence of '/' as host subcomponent of authority component if > scheme is not specified just like you have your parser treat the first > occurrence of '/' after the "//" prefix as host subcomponent in the > current way of handling schemeless requests in Gemini, the Gemini > protocol requires passing full URL in requests, therefore, such should > not be interpreted as path because Gemini requests don't allow path > without stating host. No, the spec allows both the full URI, and a relative URI as long as it starts with '//' (it has the authority section). The wording in the spec is bad and should be changed to clarify it, but that's the current specification. Again, //example.com/path/to/resource IS NOT A HACK! > So yeah, I'm not changing my mind, "//example.tld" is a hack because > that is not a valid URI and "//" is supposed to be only present when > scheme is specified, however, "example.tld" is while discouraged, > acceptable for this use case and the RFC even acknowledged it. > > Let me quote to you why it is that RFC 3986 discourages its use: > > > Although this practice of using suffix references is common, it > > should be avoided whenever possible and should never be used in > > situations where long-term references are expected. > > In the case of Gemini requests, they are not a 'long-term' reference, > they're one-time requests, I don't see any downside to not doing it. > > > Last I checked, if you connect to gemini://gemini.circumlunar.space > > and request "gemini.circumlunar.space/" you get an error. You may > > however request "//gemini.circumlunar.space/" and get the appropriate > > 20 response. Should gemini.circumlunar.space be considered to be > > running a canonical implementation of Gemini? > > You shouldn't look at any particular implementation as a reference for > the spec, I believe Philip used gemini.circumlunar.space because that's the server written by solderpunk, author of the specification. > I'm assuming gemini.circumlunar.space is running molly-brown, Also written by solderpunk. The bastard! Writing a Gemini server that doesn't follow his specification! > do you know that molly-brown treats single '\n' as valid request > terminators instead of explicit '\r\n'? (see: > https://tildegit.org/solderpunk/molly-brown/src/branch/master/handler.go#L138), > do you know that if a transaction is finished, molly-brown waits for > the client to close the connection instead of closing it from the > server side, is that spec compliant? > > The reason I think molly-brown accepted "//example.tld" in the first > place is because the Go standard library URL parser implementation > accepted this, I don't know if this was a bug or it is intended design, It's by design---see the ABNF above. > but that's what it is, other URI parsers that are more strict with > compliance to the RFC will refuse to parse a URI without scheme > present, If it does, it's broken by design. Again, see the ABNF above. > here is an excerpt from the library's documentation that might > give an idea of how they treat URLs: > > > A URL represents a parsed URL (technically, a URI reference). > > > > The general form represented is: > > > > [scheme:][//[userinfo@]host][/]path[?query][#fragment] > > > > URLs that do not start with a slash after the scheme are > > interpreted as: > > > > scheme:opaque[?query][#fragment] > > Notice that [scheme:] is enclosed in brackets implying that it is > optional, while [//host] is optional too, the "//" is considered a part > of the authority component by the Go URL parser implementation, this is > why "//example.tld" is accepted while "example.tld" is not, try passing > both strings to url.Parse() and see what you get. Yes, exactly. Again, that's per the ABNF above. Why do you not get this? Here, have one more excerpt from RFC-3986, this time from section 3: The following are two example URIs and their component parts: foo://example.com:8042/over/there?name=ferret#nose \_/ \______________/\_________/ \_________/ \__/ | | | | | scheme authority path query fragment | _____________________|__ / \ / \ urn:example:animal:ferret:nose and the URL parsing library I have parses those as: ['foo://example.com:8042/over/there?name=ferret#nose'] = { fragment = "nose", query = "name=ferret", path = "/over/there", scheme = "foo", port = 8042.000000, host = "example.com", } ['urn:example:animal:ferret:nose'] = { path = "example:animal:ferret:nose", scheme = "urn", } and because I like belaboring the inanimate equus pleonastically: ["//example.com/path/to/resource"] = { host = "example.com", path = "/path/to/resource", } ["/example.com/path/to/resource"] = { path = "/example.com/path/to/resource", } ["example.com/path/to/resource"] = { path = "example.com/path/to/resource", } You should try those with the Go URL parser you use and see what YOU get. -spc
On 2020-11-17 (Tuesday) at 22:10, Sean Conner <sean at conman.org> wrote: > > //example.com/path/to/resource > > IS A VALID URI! IT IS NOT A HACK! What part of the ABNF do you not > understand? [snip] > Again, > > //example.com/path/to/resource > > IS NOT A HACK! [snip] > -spc > Hear, hear! I was only going to list the Regex implementation[1] at the end of the RFC as proof that this wasn't a hack, but I appreciate your thoroughness in explanation. This is, in fact, why I brought it up (apparently, again, sorry about that) at all -- the current gemini spec is incompatible in this way with the URI spec. Since a goal of gemini is stated as not reinventing the wheel (okay, citation needed, but I think it's pretty much the ~feeling~ around here), we should stick to the pre-existing spec as much as possible. I liked the suggested solution from spc (the multiple ones, they're all fine, in fact!) for the update in the spec. I sincerely hope that 99% of geminauts are using URLs as we've discussed here, and I just want the spec to reflect their correct usage. [1]: https://tools.ietf.org/html/rfc3986#appendix-B -- ~ acdw acdw.net | breadpunk.club/~breadw
On Tue, Nov 17, 2020 at 5:10 PM Sean Conner <sean at conman.org> wrote: The path parsing rules state a single slash. Not '/'+, nor '/'*, but a > single '/'. The only place where more than a single slash is allowed PER > THE @#%@#$@$ ABNF is just prior to the authority, which contains the > hostname. THE ONLY PLACE! > Correct. > I will also draw your attention to the URI-reference rule, which is there > for some reason, which allows both a full URI, or a RELATIVE URI, which > means that > > //example.com/path/to/resource > > IS A VALID URI! IT IS NOT A HACK! What part of the ABNF do you not > understand? > Nope. It is a valid URI reference, because it is a valid relative reference. It is *not* a valid URI. In what follows, I am going to assume that "URL" and "URI" are synonymous, which they have been for 15 years since RFC 3986 was published. > No, the spec allows both the full URI, and a relative URI as long as it > starts with '//' (it has the authority section). The wording in the spec > is > bad and should be changed to clarify it, but that's the current > specification. > There are two cases: 1) In a Gemini-protocol request line (section 2), the second sentence says that an absolute URL (that is, a URI without a fragment identifier) is required. The third sentence says that if the "scheme://" portion is missing (in which case it is not a URI, much less an absolute URI), it should be prefixed with "gemini://" and presumably reparsed. That's straightforward. 2) In a link line (section 5.4.2), we are told that there may be an absolute or a relative URL. There are no relative URIs, so we can only interpret this as meaning a relative reference. We are also told that if the URL lacks a scheme (which is impossible: a URI always has a scheme) then the scheme is "gemini". Now suppose a link line in a resource that is available from "gemini:// example.com/public/this.gmi" has the form "foo/bar/baz.gmi". We can interpret this in one of two incompatible ways: 2a) a truncated version of "gemini://foo/bar/baz.gmi". Note that "foo" is a perfectly valid host name. 2b) a relative reference, in which case it resolves to "gemini:// example.com/public/foo/bar/baz.gmi". So the spec is self-contradictory. In my view interpretation 2a is bogus and the sentence "If the URL does not include a scheme, a scheme of gemini:// is implied" in section 5.4.2 should be removed. What is more, I would like to see the equivalent sentence "If the scheme of the URL is not specified, a scheme of gemini:// is implied" removed as well. > but that's what it is, other URI parsers that are more strict with > > compliance to the RFC will refuse to parse a URI without scheme > > present, > > If it does, it's broken by design. Again, see the ABNF above. > It is precisely the ABNF line in RFC 3986 section 3 that says a URI (as opposed to a URI reference) has to begin with a scheme. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org It's the old, old story. Droid meets droid. Droid becomes chameleon. Droid loses chameleon, chameleon becomes blob, droid gets blob back again. It's a classic tale. --Kryten, Red Dwarf
> On Nov 17, 2020, at 23:10, Sean Conner <sean at conman.org> wrote: > > belaboring the inanimate equus Ohhh... pig latin, my favorite! Oggingflay away eadday orsehay! Quidquid latine dictum sit, altum sonatur!
I'm going to use a real-world example here because people seem to not get why this may be a problem. Let's say I want to start hosting the git repo for my utility gemlog.sh[a] on gemini. I make a directory on my site, so the full url would be `gemini://nytpu.com/gemlog.sh/`. Now, say I put a link in my root index.gmi (`/`) linking to `gemlog.sh`[b]. This is a perfectly valid link to a directory on my server, but this would instead be interpreted as the url `gemini://gemlog.sh/` if you use the faulty method of parsing. (`.sh` is a valid TLD[c] so it wouldn't work even if you have a whitelist of tlds). Now, there's a few options to prevent this from happening: 1) Ban periods in all file & directory names. You'd also have to ban it in filenames, because what if I make the relative link to a file called `command.com`? Requires large, breaking spec changes. 2) Instead of documents being served as-is and having clients parse urls, instead force servers to rewrite all urls, checking if it is a valid directory or not before serving. All clients only expect well- formed, full urls, and all existing server implementations are in violation. Requires large, breaking spec changes. 3) Require that links to directories must not be relative if they could be confused as a uri host. This is an inconsistent, quick fix that is very ambiguous, because one client may think it's a valid host while others may not. It also puts the burden on the authors of documents, because now they have to remember when relative links are allowed and when they aren't, and test their documents on a variety of clients to ensure that it is compatible with all their parsing methods. Requires large, breaking spec changes. 4) Follow the carefully and clearly defined specification[d] that is over 15 years old and is well-adopted by existing uri parsing libraries. Requires minimal, non-breaking spec changes, purely for clarity. I know which one I'd choose. Obviously option 1 is the only real option here, the outlandish ones like option 4 just make no sense. [a]: https://tildegit.org/nytpu/gemlog.sh [b]: so the full line would read: `=> gemlog.sh a utility for managing gemlogs from the command line` [c]: https://en.wikipedia.org/wiki/.sh [d]: https://tools.ietf.org/html/rfc3986 -- Alex // nytpu alex at nytpu.com GPG Key: https://www.nytpu.com/files/pubkey.asc Key fingerprint: 43A5 890C EE85 EA1F 8C88 9492 ECCD C07B 337B 8F5B https://useplaintext.email/
On Tue, 17 Nov 2020 17:45:50 -0500 John Cowan <cowan at ccil.org> wrote: > In what follows, I am going to assume that "URL" and "URI" are synonymous, > which they have been for 15 years since RFC 3986 was published. That may not be an entirely uncontroversial assumption. URLs were AFAIK last defined by the IETF in RFC 1808, where relative URLs were first specified and the distinction became necessary. In RFC 1808, an URL is either an absolute URL or a relative URL (analogous to relative-ref). In that sense, an URL is rather analogous with URI-reference of RFC 3986. I completely agree on all other points, and the point above is only further reason for clarification. What is and isn't an URL is a bit loosey-goosey throughout, which is why RFC 3986 is welcome. -- Philip
On Tue, Nov 17, 2020 at 6:07 PM Alex // nytpu <alex at nytpu.com> wrote: > Let's say I want to start hosting the git repo for my utility > gemlog.sh[a] on gemini. I make a directory on my site, so the full url > would be `gemini://nytpu.com/gemlog.sh/` <http://nytpu.com/gemlog.sh/>. > Now, say I put a link in my > root index.gmi (`/`) linking to `gemlog.sh`[b]. This is a perfectly > valid link to a directory on my server, but this would instead be > interpreted as the url `gemini://gemlog.sh/` <http://gemlog.sh/> if you > use the faulty > method of parsing. (`.sh` is a valid TLD[c] so it wouldn't work even if > you have a whitelist of tlds). > In any case, nothing says a hostname has to be absolute. If your hostname is "client.example.com" then you can refer to "server.example.com" as simply "server". The only way to tell if "server" is a meaningful host is to ask the DNS, and the answer can change. > 4) Follow the carefully and clearly defined specification[d] that is > over 15 years old and is well-adopted by existing uri parsing libraries. > Requires minimal, non-breaking spec changes, purely for clarity. > Requires a small breaking spec change to remove the sentence about defaulting to "gemini://" in 5.4.2 and preferably in 2 as well. But 5.4.2 is self-contradictory and has to be fixed. My proposal is to rewrite section 2 to say this: <URL> is an absolute URL according to RFC 3986, of maximum length 1024 bytes. And to rewrite section 5.4.2 to say this: <URL> is a URI reference according to RFC 3986.
On Tue, 17 Nov 2020 16:07:08 -0700 Alex // nytpu <alex at nytpu.com> wrote: > I'm going to use a real-world example here because people seem to not > get why this may be a problem. > > Let's say I want to start hosting the git repo for my utility > gemlog.sh[a] on gemini. I make a directory on my site, so the full url > would be `gemini://nytpu.com/gemlog.sh/`. Now, say I put a link in my > root index.gmi (`/`) linking to `gemlog.sh`[b]. This is a perfectly > valid link to a directory on my server, but this would instead be > interpreted as the url `gemini://gemlog.sh/` if you use the faulty > method of parsing. (`.sh` is a valid TLD[c] so it wouldn't work even if > you have a whitelist of tlds). I think that we all actually agree that this can't possibly work for links. What Ali Fardan is suggesting is to allow suffix references only in requests, where the ambiguity could be avoided for the simple reason that the request must contain an authority. I completely disagree that suffix references should be used anywhere, but the suggestion is not quite so outlandish as to require any of options 1-3. It should be avoided for the simple reason that it precludes option 4. -- Philip
It was thus said that the Great John Cowan once stated: > > Requires a small breaking spec change to remove the sentence about > defaulting to "gemini://" in 5.4.2 and preferably in 2 as well. But 5.4.2 > is self-contradictory and has to be fixed. > > My proposal is to rewrite section 2 to say this: > > <URL> is an absolute URL according to RFC 3986, of maximum length 1024 > bytes. > > And to rewrite section 5.4.2 to say this: > > <URL> is a URI reference according to RFC 3986. I've gone over the path month of logs [1] on my Gemini server and pulled some stats. Total number of requests: 103,422 Total number of schemeless requests: 275 And of the schemeless requests: client #1 2 requests client #2 3 requests client #3 270 requests Given the relative rarity of such requests (0.2% of all requests) and the number of clients requesting schemeless requests (between 0.3% to 8% [2]) I would agree with this proposal. A Gemini request is an absolute URL (per RFC-3986). -spc [1] It's all I keep [2] Okay, on the Gemini software page [3], I count 37 known clients. There are some others not listed, like CAPCOM, Spacewalk and GUS, but even excluding those, 3 out of 37 is 8%. And assuming that all 1,187 unique IP addresses were using a unique client, then the percentage falls to 0.3%. The truth is somewhere in between. Also, my server probably gets hit by *every* client, as it serves up the Gemini Client Torture test. [3] https://portal.mozz.us/gemini/gemini.circumlunar.space/software/
Sean Connor wrote: > ? The path parsing rules state a single slash.? Not '/'+, nor '/'*, > but a > single '/'.? The only place where more than a single slash is allowed > PER > THE @#%@#$@$ ABNF is just prior to the authority, which contains the > hostname.? THE ONLY PLACE!? I am currently working on a bug in lagrange concerning this question. It appeared to me, that multiple consecutive slashes might also be allowed in the query, according to the ABNF, but I may be very wrong there.
It was thus said that the Great Waweic once stated: > Sean Connor wrote: > > > ? The path parsing rules state a single slash.? Not '/'+, nor '/'*, > > but a > > single '/'.? The only place where more than a single slash is allowed > > PER > > THE @#%@#$@$ ABNF is just prior to the authority, which contains the > > hostname.? THE ONLY PLACE!? > > I am currently working on a bug in lagrange concerning this question. > It appeared to me, that multiple consecutive slashes might also be > allowed in the query, according to the ABNF, but I may be very wrong > there. In the query section, yes, it should be. In the path section, it should be disallowed. Unfortunately, I checked the ABNF in RFC-3986 and it does appear to allow double slashes in the path section. The rules in question: path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) segment = *pchar A segment can be 0 or more characters, so per the spec, you could end up with muliple slashes, and the URL parsing library I use, written against the ABNF of RFC-3986, does in fact, accept it: ["path//to//resource"] = { path = "path//to//resource", } There's nothing in the errata [1] about this, but it seems like it should be fixed. -spc [1] https://www.rfc-editor.org/errata_search.php?rfc=3986
On Tue, 17 Nov 2020 21:02:09 -0500 Sean Conner <sean at conman.org> wrote: > There's nothing in the errata [1] about this, but it seems like it should > be fixed. Nothing needs to be fixed. Zero length path segments are allowed in some circumstances, but they are never allowed in a circumstance where they could cause ambiguities. For this purpose, there are multiple definitions of path segments, with -nz (non-empty) and -nz-nc (non-empty, no colon) suffixes" path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) path-empty = 0<pchar> You can see that relative-ref is designed in such a way as to disallow any ambiguity, by only allowing path-absolute (which starts with a single slash and a non-empty segment), path-noscheme (which starts with a non-empty segment not containing a colon) or path-empty (which is zero characters): relative-ref = relative-part [ "?" query ] [ "#" fragment ] relative-part = "//" authority path-abempty / path-absolute / path-noscheme / path-empty The "path" definition itself can not be distinguished from a relative-ref or relative-part, but the path definition is never used by any other definition in the document. If parsing a relative-ref or URI-reference, this is never a problem. -- Philip
While you are discussing about the specs, please have a look at how the servers are currently responding to the edge cases. http://ix.io/2EyQ Request -> Response (first line only) The list of known servers from gemini://gus.guru/known-hosts : removed all non existent servers and *.flounder.online Test yourself: http://ix.io/2Etk And if you can, forgive my madness.
It was thus said that the Great Sudipto Mallick once stated: > While you are discussing about the specs, please have a look at how > the servers are currently responding to the edge cases. > > http://ix.io/2EyQ > > Request -> Response (first line only) > The list of known servers from gemini://gus.guru/known-hosts : removed > all non existent servers and *.flounder.online > Test yourself: http://ix.io/2Etk > > And if you can, forgive my madness. Thank you for running this and reporting the results. I can describe why you got the results for my server: gemini.conman.org gemini.conman.org -> 59 Bad Request gemini.conman.org/ -> 59 Bad Request gemini.conman.org// -> 59 Bad Request These are bad because there's no scheme nor authority (missing a '//') and thus, these are marked as a bad request. //gemini.conman.org -> 20 text/gemini //gemini.conman.org/ -> 20 text/gemini //gemini.conman.org// -> 59 Bad Request These are missing the scheme, but have an authority section [1]. The URL parser I use adds a '/' for the path if the path does not exist. That's why my server does not do a 31-redirect with a missing '/' at the end. The double slash at the end is being checked by a modified path-abempty rule. The ABNF from the RFC is: path-abempty = *( "/" segment ) while the URL parser I'm using is doing: path_abempty <- {~ ( '/' segment)+ ~} / '' -> '/' The parsing code is in LPEG [2] and is equivalent to path-abempty = +( "/" segment) / 0<pchar> # and return a '/' and was written that way to fix an issue inherent with the ABNF of "0<pchar>" and how parsing works with LPEG. I can go into details of LPEG if anyone is interested, but suffice to say, the path_abempty of LPEG is different from the ABNF of the RFC for a good reason, and this is why the trailing '//' from the authority section is not parsing. gemini://gemini.conman.org -> 20 text/gemini gemini://gemini.conman.org/ -> 20 text/gemini gemini://gemini.conman.org// -> 59 Bad Request A more normal request, and the same explanation from above. No surprises for my server (at least, to me). A more interesting response is from blekksprut.net and cadence.moe: blekksprut.net -> 20 text/gemini blekksprut.net/ -> 20 text/gemini blekksprut.net// -> 20 text/gemini //blekksprut.net -> 51 not found //blekksprut.net/ -> 51 not found //blekksprut.net// -> 51 not found gemini://blekksprut.net -> 20 text/gemini gemini://blekksprut.net/ -> 20 text/gemini gemini://blekksprut.net// -> 20 text/gemini cadence.moe -> 20 text/gemini; charset=utf-8; lang=en cadence.moe/ -> 20 text/gemini; charset=utf-8; lang=en cadence.moe// -> 20 text/gemini; charset=utf-8; lang=en //cadence.moe -> 50 Bliz server: Not found: //cadence.moe //cadence.moe/ -> 50 Bliz server: Not found: //cadence.moe/ //cadence.moe// -> 50 Bliz server: Not found: //cadence.moe// gemini://cadence.moe -> 20 text/gemini; charset=utf-8; lang=en gemini://cadence.moe/ -> 20 text/gemini; charset=utf-8; lang=en gemini://cadence.moe// -> 20 text/gemini; charset=utf-8; lang=en These results probably stem from a same issue, but possibly different servers. Just going quickly through the results, if there was no problem with the first grouping (just the domain name), it seems the servers *have* an issue with the second grouping (leading '//'). Odd. Again, thanks for this. -spc [1] I've been debating if I should mark a missing scheme as a "bad request" as I've come around to support that a Gemini server should ONLY accept an absolute URL. I haven't ... yet. [2] Lua Parsing Expression Grammar
On Wed, Nov 18, 2020 at 03:42:57AM -0500, Sean Conner wrote: > A more normal request, and the same explanation from above. No surprises > for my server (at least, to me). A more interesting response is from > blekksprut.net and cadence.moe: > > [...] > > These results probably stem from a same issue, but possibly different > servers. They're definitely different servers - blekksprut.net is running on my own code... The results were a little surprising, so I'm going to be doing some bugfixing again tonight ;) > Again, thanks for this. Seconding the thanks, this is great stuff! bie
Statistics from the data I collected: request response code -> percentange : : "$host" 59 -> 55% 53 -> 22% 20 -> 4.8% "$host/" 59 -> 55% 53 -> 22% 51 -> 7.7% 20 -> 6.8% "$host//" 59 -> 55% 53 -> 22% 51 -> 7.7% 20 -> 6.4% "//$host" 31 -> 55% (!) 20 -> 29% 59 -> 12% 51 -> 7.7% "//$host/" 20 -> 67% 59 -> 12% 51 -> 7% 53 -> 2% 50 -> 2% "//$host/" 20 -> 61% 59 -> 15% 51 -> 10% "gemini://$host" 31 -> 57.6% (!!) 20 -> 34% 30 -> 1.6% "gemini://$host/" 20 -> 93% "gemini://$host//" 20 -> 84% 51 -> 6% out of http://ix.io/2EzQ
Clicked send button too fast... the second "//$host/" should be "//$host//"
On Wed, 18 Nov 2020 20:59:53 +0530 Sudipto Mallick <smallick.dev at gmail.com> wrote: Very interesting and good summary, Sudipto. > "//$host" > 31 -> 55% (!) > 20 -> 29% > 59 -> 12% > 51 -> 7.7% There is probably some overlap here with hosts that generally serve redirects for empty paths. > "gemini://$host" > 31 -> 57.6% (!!) > 20 -> 34% > 30 -> 1.6% > > "gemini://$host/" > 20 -> 93% This is alarming IMO. I have expressed it before in the mailing list, but because of the normalization rules of RFC 3986, an empty path is
2020/11/18 16:46, Philip Linde: >> "gemini://$host" >> 31 -> 57.6% (!!) >> 20 -> 34% >> 30 -> 1.6% >> >> "gemini://$host/" >> 20 -> 93% > > This is alarming IMO. I have expressed it before in the mailing list, > but because of the normalization rules of RFC 3986, an empty path is > *equivalent* to the path "/". Serving a 3x redirect on one and a page on > the other is wrong. > > In this case it's likely rather benign that they serve different > content, because I assume that a client will arrive at the same resource > after following a redirect, but it has to be understood that a client > might make these generalizations as well, in which case that client > can't access the resource that's served when requesting an empty path. > > It would be interesting to figure out which server software is the > culprit. If you want to point fingers: https://github.com/michael-lazar/gemini-diagnostics/blob/master/gemini-diagnostics#L440 That's what I based my implementation on and I suspect many others did so too. R.
It was thus said that the Great Remco once stated: > 2020/11/18 16:46, Philip Linde: > > >> "gemini://$host" > >> 31 -> 57.6% (!!) > >> 20 -> 34% > >> 30 -> 1.6% > >> > >> "gemini://$host/" > >> 20 -> 93% > > > > This is alarming IMO. I have expressed it before in the mailing list, > > but because of the normalization rules of RFC 3986, an empty path is > > *equivalent* to the path "/". Serving a 3x redirect on one and a page on > > the other is wrong. > > > > In this case it's likely rather benign that they serve different > > content, because I assume that a client will arrive at the same resource > > after following a redirect, but it has to be understood that a client > > might make these generalizations as well, in which case that client > > can't access the resource that's served when requesting an empty path. > > > > It would be interesting to figure out which server software is the > > culprit. > > If you want to point fingers: > > https://github.com/michael-lazar/gemini-diagnostics/blob/master/gemini- diagnostics#L440 > > That's what I based my implementation on and I suspect many others did > so too. The test isn't *wrong* per se, it's just testing at the wrong level. My server will return: gemini://gemini.conman.org -> 20 gemini://gemini.conman.org/ -> 20 but gemini://gemini.conman.org/test -> 31 gemini://gemini.conman.org/test/ which is what that test is testing. -spc
On Wed, Nov 18, 2020 at 4:10 PM Sean Conner <sean at conman.org> wrote: > > It was thus said that the Great Remco once stated: > > > > If you want to point fingers: > > > > https://github.com/michael-lazar/gemini-diagnostics/blob/master/gemini- diagnostics#L440 > > > > That's what I based my implementation on and I suspect many others did > > so too. > > The test isn't *wrong* per se, it's just testing at the wrong level. My > server will return: > > gemini://gemini.conman.org -> 20 > gemini://gemini.conman.org/ -> 20 > > but > > gemini://gemini.conman.org/test -> 31 gemini://gemini.conman.org/test/ > > which is what that test is testing. > > -spc Yes that's probably what I meant to do. It was difficult to write many of the tests because they can't assume that any particular directory exists on the server. I didn't realize that the root URL was special in this regard. I think this is an interesting problem for the gemini protocol. In HTTP you typically only have one way to write out this request so it's never a problem: GET / HTTP/1.1 Even though "gemini://example.com" and "gemini://example.com/" are supposed to be identical per the URL definition, good luck getting gemini developers to read through 100+ pages of RFCs and implement this correctly. - Michael
---