Hello guys, I am not normally one to get involved in the Gemini spec. I got into a discussion with Martin about how UTF-8 characters are supposed to be handled in links in Gemini documents. On my site I have a link in a document like this: => logarion/ta?ikio--lando-montara.gmi Ta?ikio: Lando Montara This refers locally to the actual file name on the disk. The main question is, is this allowed in Gemini documents? I thought this should work because I believe that Gemini is UTF-8 native or by default, and my Unix file system (in this case FreeBSD UFS) appears to be in agreement. The question is whether clients must support this as well, as in my experience so far all of them do but Martin's client seems to reject this link due to containing non-ASCII characters and doesn't handle it. He said that RFC3986 does not allow links to have non-ASCII characters, but perhaps this isn't relevant to Gemini's internal encoding and document format, but rather for exported URI's (ie made universal). It does seem a proper URI should best contain %C4%9D in place of ?, but the question is whether I should change it in the document? Does the internal linking (in my case the link is local/relative) even count as a URI? Ben -- gemini://kwiecien.us/
On Fri, 2020-07-17 at 13:51 +0430, Ben wrote: > It does seem a proper URI should best contain %C4%9D in place of ?, > but > the question is whether I should change it in the document? Does the > internal linking (in my case the link is local/relative) even count > as a > URI? I think you should change it to %C4%9D; otherwise you're relying on clients to do it for you (which might work). But in the example you give, the stuff after => is a URL, just not an absolute one. It's a relative URL and the client is supposed to know how to combine it with the URL of our current URL and to request this document if the user wants to follow the link. So if the current URL is gemini://example.org/foo/bar then following the link below will take the user to gemini://example.org/foo/logarion/ta%C4%9Dikio. => logarion/ta?ikio--lando-montara.gmi Ta?ikio: Lando Montara A good way to think about this would be spaces in file names. Assume the filename is "ta?ikio: lando montara.gmi". What would you write? This won't work: => logarion/ta?ikio: lando montara.gmi Ta?ikio: Lando Montara If you escape the spaces, why not escape the rest that needs escaping? => logarion/ta?ikio:%20lando%20montara.gmi Ta?ikio: Lando Montara That's how I reason about it. Or if you want to go all-in, RFC 3986 has you covered. The only characters that unambiguously never have to escaped, no matter where they appear, are the unreserved ones: Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde. unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" https://tools.ietf.org/html/rfc3986#section-2.3 And to be clear, ALPHA means a-z and A-Z, nothing else. Cheers Alex
On Fri Jul 17, 2020 at 11:21 AM CEST, Ben wrote: > Hello guys, I am not normally one to get involved in the Gemini spec. I > got into a discussion with Martin about how UTF-8 characters are > supposed to be handled in links in Gemini documents. > > On my site I have a link in a document like this: > > => logarion/ta?ikio--lando-montara.gmi Ta?ikio: Lando Montara > > This refers locally to the actual file name on the disk. The main > question is, is this allowed in Gemini documents? I thought this should > work because I believe that Gemini is UTF-8 native or by default, and my > Unix file system (in this case FreeBSD UFS) appears to be in agreement. Aaah, I figured we were going to have to deal with this sooner or later. This has been one of those few remaining unpleasant details in the back of my mind that I know needs to get sorted out. It's because of the existence of things like this that I'm so averse to adding anything new to the spec - it runs the risk of introducing more things like this, which aren't obvious at first but then come up only after a few months of use. The spec currently uses language like "UTF-8 encoded absolute URL" which I have to admit has been there since the very earliest version and which I wrote without any kind of deeper awareness of how this intersected with existing RFCs. I've since come to realise that it's very possible that this language is potentially ambiguous at best, and contradictory at worse. I suspect this is going to need a bit of reading and thinking to come up with a clear stance on and to make appropriate changes on the spec... > It does seem a proper URI should best contain %C4%9D in place of ?, but > the question is whether I should change it in the document? Does the > internal linking (in my case the link is local/relative) even count as a > URI? It definitely counts as a relative URI. Cheers, Solderpunk
---