(Starting a separate thread for this) I think there are three possible places where IRIs could possibly appear in Gemini: 1) In client inputs (the address bar or CLI analogue) and outputs (revealing a link) 2) In the Gemini protocol 3) In text/gemini link lines I think it's important to disentangle these three cases. Case 1 just affects individual clients and can be left up to them, except that there is some best-practice advice about when *n?t* to display an IRI, specifically when there are cross-script confusables involved. For example, "gemini://gemini.circumlunar.xn--spce-63d/" should not be displayed as "gemini://gemini.circumlunar.sp?ce", because that would be deceptive, even in Gemini: you might be pointed to the Evil Version of the Gemini spec and not realize it. I think everyone agrees that Case 2 is a mistake: the protocol elements should continue to be URIs. Case 3 is the difficult one. Should authors be allowed to write text/gemini links with IRI references? It's not that hard for a client to convert them to URI references. No normalization is needed except as part of punycoding. However, everyone has to agree on whether this should work or not; we don't want a user trying to follow a link and sending the Wrong Thing to the server. Gemini isn't just supposed to be easy to program for, it's supposed to be easy to author, too. Unfortunately these objectives are in conflict here. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org Lope de Vega: "It wonders me I can speak at all. Some caitiff rogue did rudely yerk me on the knob, wherefrom my wits yet wander." An Englishman: "Ay, belike a filchman to the nab'll leave you crank for a spell." --Harry Turtledove, Ruled Britannia -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201207/ed25 588c/attachment.htm>
> (Starting a separate thread for this) Thanks, that's helpful. Hopefully all IRI discussion can move here. > Case 3 is the difficult one.? Should authors be allowed to write text/gemini > links with IRI references??It's not that hard for a client to convert them > to URI references.? No normalization is needed except as part of punycoding. > However, everyone has to agree on whether this should work or not; we don't > want a user trying to follow a link and sending the Wrong Thing to the server. "It's not that hard for a client to convert them to URI references. No normalization is needed except as part of punycoding." I don't think that's true. To convert them to a URI reference, the domain needs to be extracted and punycoded, then the path and query string needs to be extracted and percent-encoded in the blessed Gemini way that doesn't allow plus signs. Doing all this requires parsing, and as I explained a couple times in the other thread, IRI parsing is not feasible across multiple programming languages at this time, the libraries just don't exist. And what if the IRI is a relative reference? As I explained in the other thread, this will definitely require IRI parsing. Furthermore, it's breaking change to Gemini. I don't think that's a good idea in any case with the possible exception of TLS security. Gemini must be reliable, and it's too late for a breaking change. > Gemini isn't just supposed to be easy to program for, it's supposed to be easy to > author, too.? Unfortunately these objectives are in conflict here. Yes, and that's unfortunate. But I think it makes sense for the stability of Gemini and the ease of programming to come first. Cheers, makeworld
On Mon, Dec 7, 2020 at 9:47 PM <colecmac at protonmail.com> wrote: > I don't think that's true. To convert them to a URI reference, the domain > needs > to be extracted and punycoded, Agreed. But if you have a Punycode encoder, then the following steps will convert an IRI reference to a URI reference, without regard to whether it is an IRI or a relative reference: 1) Look in the IRI reference for a "//" and a following "/"; if they exist, pass the characters in between through your encoder and substitute the result into the IRI reference. 2) Start over from the beginning. If a character is ASCII, leave it unchanged. Otherwise, take the character, convert it to UTF-8 bytes (easy) and each byte to hex digits (trivial), decorate it with leading % (trivial), and move on. When you come to the end, stop. > Furthermore, it's breaking change to Gemini. I don't think that's a good > idea in > any case with the possible exception of TLS security. Gemini must be > reliable, > and it's too late for a breaking change. > Probably true. ~~sigh~~ John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org Normally I can handle panic attacks on my own; but panic is, at the moment, a way of life. --Joseph Zitt -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201207/67b2 3603/attachment.htm>
On Mon, Dec 7, 2020, at 18:47, colecmac at protonmail.com wrote: > > Yes, and that's unfortunate. But I think it makes sense for the > stability of Gemini > and the ease of programming to come first. I'm perplexed that "ease of programming" is considered more important than "ease of adoption." You mention that not every language supports the libraries needed for internationalized URLs. What does that lose the project vs. accessibility and broader adoption by non-English-speaking users for who Gemini would be a boon with limited bandwidth and hardware? I feel like I'm missing something with the emphasis on ease of client implementation over adoption. Emma Humphries gemini://gemini.djinn.party/
> > Yes, and that's unfortunate. But I think it makes sense for the > > stability of Gemini > > and the ease of programming to come first. > > I'm perplexed that "ease of programming" is considered more important than "ease of adoption." > > You mention that not every language supports the libraries needed for internationalized URLs. > > What does that lose the project vs. accessibility and broader adoption by non-English-speaking users for who Gemini would be a boon with limited bandwidth and hardware? > > I feel like I'm missing something with the emphasis on ease of client implementation over adoption. I was unsure when I wrote that, and I was worried it would be controversial. But I still think it makes sense. For Gemini to be accessible, have "broader adoption", and be "a boon" as you mention, clients need to be easy to write and maintain. Otherwise, what will these non-English speaking users browse and serve their content with? A few clients and servers, likely not written in their native language? Gemini is a non-commercial hobby project for all the developers I am aware of, and there are advantages to that. But it also means that if the protocol is hard to implement, the whole community suffers, because there will be fewer clients and servers. The fact that writing URLs for non-English languages is difficult sucks. But due the complexity, and most of all the fact that this would be a breaking change, I don't see IRIs as an option. makeworld P.S. I'll admit I'm biased. I write more code for Gemini than I do content, and primarily use my native language English.
It was thus said that the Great colecmac at protonmail.com once stated: > > The fact that writing URLs for non-English languages is difficult sucks. But > due the complexity, and most of all the fact that this would be a breaking > change, I don't see IRIs as an option. I thought I might see what's involved with handling IRIs. The actual differences between RFC-3986 (URI) and RFC-3987 (IRI) besides one being a standard (URI) and one being a proposed standard (IRI) comes down to the characters that are accepted---the unreserved set unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" becomes iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD / %xD0000-DFFFD / %xE1000-EFFFD and the query portion changes from query = *( pchar / "/" / "?" ) pchar = unreserved / pct-encoded / sub-delims / ":" / "@" to iquery = *( ipchar / iprivate / "/" / "?" ) ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@" iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD and that's it as far as the RFCs go (aside from the rule name changes). As a quick proof-of-concept, I just accepted all non-control UTF-8 characters as unreserved (including the private space) as that was the easiest thing to do, and yes, it works (but does allow potentially bad IRIs through). But the code to *build* a URL from the parsed representation [2] ssumes US-ASCII. Again, it would take just a few small changes to allow UTF-8 characters on input and escape them properly for a URL. That's something I'll try working on tomorrow. That still leaves the question of punycode [3] and Unicode normalization (ugh). > P.S. I'll admit I'm biased. I write more code for Gemini than I do content, and > primarily use my native language English. I am biased too, as a monolingual US mutt, but I do want to try this stuff out. -spc [1] https://github.com/spc476/LPeg-Parsers/blob/master/url.lua [2] https://github.com/spc476/GLV-1.12556/blob/master/Lua/GLV-1/url-util.lua [3] RFC-3492, which includes C code to encode and decode punycode text, which is valgrind clean (I checked).
On Mon, 7 Dec 2020 23:00:01 -0500 John Cowan <cowan at ccil.org> wrote: > Agreed. But if you have a Punycode encoder, then the following steps will > convert an IRI reference to a URI reference, without regard to whether it > is an IRI or a relative reference: > > 1) Look in the IRI reference for a "//" and a following "/"; if they exist, > pass the characters in between through your encoder and substitute the > result into the IRI reference. > > 2) Start over from the beginning. If a character is ASCII, leave it > unchanged. Otherwise, take the character, convert it to UTF-8 bytes (easy) > and each byte to hex digits (trivial), decorate it with leading % > (trivial), and move on. When you come to the end, stop. There's a "drawl the owl" step somewhere here regarding Unicode normalization. Does the server like your ?:s fully composed or decomposed, or should the server itself be responsible for normalization? -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/cf4d c147/attachment.sig>
> On Dec 8, 2020, at 05:06, Emma Humphries <ech at emmah.net> wrote: > > I'm perplexed that "ease of programming" is considered more important than "ease of adoption." Or basic use for that matter. Gemini's narrative is build upon the fallacy that everything is "just-oh-so-trivial": any Dick and Jane can fire up the most bare bone of telnet over their trusty dial-up modem and be done. Some sort of citizen-programmer-publisher nirvana, without any barriers to entry whatsoever. Admirable. But not quite practical. The internet stack is deep, old, and brittle. Tooling matter. Still, all very admirable :) Long live Gemini.
On Mon, 07 Dec 2020 20:06:27 -0800 "Emma Humphries" <ech at emmah.net> wrote: > I'm perplexed that "ease of programming" is considered more important than "ease of adoption." Consider "ease of programming" and in particular stability a subset of "ease of adoption". There are numerous client and server implementations because it is easy to implement, and easy to maintain because the protocol is relatively stable even in these early stages. The different software allows people with different goals to adopt the protocol, and helps in weeding out shortcomings of clarity in the specification by analysis of the subtle differences between implementations. > You mention that not every language supports the libraries needed for internationalized URLs. > > What does that lose the project vs. accessibility and broader adoption by non-English-speaking users for who Gemini would be a boon with limited bandwidth and hardware? It seems more likely that a change to this end would hurt adoption. Numerous pieces of existing Gemini software would immediately be invalidated. Not all of them will be updated to accommodate the change. I could perhaps see a more pressing need for the change if internet users worldwide weren't already used to transliteration. It's such a small part as well. UTF-8 is acceptable (and default) in text/gemini documents, and the text content of a capsule can indeed be written in any of the scripts supported by Unicode. -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/e511 9f33/attachment.sig>
On Mon, Dec 07, 2020 at 09:37:28PM -0500, John Cowan <cowan at ccil.org> wrote a message of 117 lines which said: > I think there are three possible places where IRIs could possibly appear in > Gemini: A fourth one is server configuration. (When you declare the virtual host, for instance.) > some best-practice advice about when *n?t* to display an IRI, specifically > when there are cross-script confusables involved. For example, > "gemini://gemini.circumlunar.xn--spce-63d/" should not be displayed as > "gemini://gemini.circumlunar.sp?ce", because that would be deceptive, even > in Gemini: you might be pointed to the Evil Version of the Gemini spec and > not realize it. As I said, I regard "homograph attacks" as mostly a tale to discourage people to use Unicode. They are not a real-world problem.
On Tue, Dec 08, 2020 at 02:47:22AM +0000, colecmac at protonmail.com <colecmac at protonmail.com> wrote a message of 50 lines which said: > Furthermore, it's breaking change to Gemini. I don't think that's a > good idea in any case with the possible exception of TLS > security. Gemini must be reliable, and it's too late for a breaking > change. Hold on. I'm a newbie in Gemini and I was under the impression that Gemini is still experimental and the specification still in flux. If it is not true, if Gemini is frozen and "take it or leave it", that's a different matter, and we could save some time by rejecting many discussions.
It was thus said that the Great Stephane Bortzmeyer once stated: > On Tue, Dec 08, 2020 at 02:47:22AM +0000, > colecmac at protonmail.com <colecmac at protonmail.com> wrote > a message of 50 lines which said: > > > Furthermore, it's breaking change to Gemini. I don't think that's a > > good idea in any case with the possible exception of TLS > > security. Gemini must be reliable, and it's too late for a breaking > > change. > > Hold on. I'm a newbie in Gemini and I was under the impression that > Gemini is still experimental and the specification still in flux. If > it is not true, if Gemini is frozen and "take it or leave it", that's > a different matter, and we could save some time by rejecting many > discussions. I know Solderpunk wants to do a series of freezes then thaws as things are worked on, but I think things progress a bit faster than he can deal with, or wants to deal with, given his long absences on the list. For me personally, I think this should be worked out, and I'm working towards that with my own server [1]. I've had to make changes to GLV-1.12556 in the past when the protocol changed, I can change it again. -spc [1] https://github.com/spc476/GLV-1.12556
Philip Linde writes: > On Mon, 07 Dec 2020 20:06:27 -0800 > "Emma Humphries" <ech at emmah.net> wrote: > >> I'm perplexed that "ease of programming" is considered more important than "ease of adoption." > > Consider "ease of programming" and in particular stability a subset of > "ease of adoption". There are numerous client and server > implementations because it is easy to implement, and easy to maintain > because the protocol is relatively stable even in these early stages. > The different software allows people with different goals to adopt the > protocol, and helps in weeding out shortcomings of clarity in the > specification by analysis of the subtle differences between > implementations. > >> You mention that not every language supports the libraries needed for internationalized URLs. >> >> What does that lose the project vs. accessibility and broader adoption by non-English-speaking users for who Gemini would be a boon with limited bandwidth and hardware? > > It seems more likely that a change to this end would hurt adoption. > Numerous pieces of existing Gemini software would immediately be > invalidated. Not all of them will be updated to accommodate the change. > I could perhaps see a more pressing need for the change if internet > users worldwide weren't already used to transliteration. It's such a > small part as well. UTF-8 is acceptable (and default) in text/gemini > documents, and the text content of a capsule can indeed be written in > any of the scripts supported by Unicode. Hey, I'm new to this list, and a new Gemini user, but this topic is fairly important to me. It's discouraging to see a lot of fear-mongering around this topic already. Some points that have come up a few times already in this thread as well as the IDN thread that I think are worth addressing: 1. Homograph attacks Stephane has already mentioned in a different response that homograph attacks are fairly rare. I don't have the knowledge to say whether or not that's accurate, but I can speak to how they're mitigated. In general, browsers will render the domain in the URI bar if all of the characters in the each section belong to the same script. As an example, https://?pple.com will not render correctly in Firefox in the URI bar, but https://?????.com/ will render correctly (both domains do not exist if you want to check). The other half of this comes down to domain registrars not allowing registrations of domains with homographs (depends on the TLD, of course). What this comes down to, is that Gemini clients, if they wish to mitigate this type of attack, should apply the same algorithm as web browsers. Again, given the preference for client certs for authenticating sessions, it doesn't seem like this attack would have dire consequences anyway. I also think I saw someone mention that they're worried about it from the IRI side as well? That attack doesn't seem like much of a realistic case, since if they direct you to a different page on the same server, you're well, still on the same server. This only becomes problematic in the case of shared hosting of untrusted tenants. 2. Normalization There's been a bit of fear-mongering about normalization which I can totally understand, since a first look at Unicode technical reports and the 4 normalization forms looks intimidating at first glance. However, as pointed out in a few RFCs, NFC is more or less the only normalization form that you need to worry about in *most* circumstances. Typed URIs should be normalized in NFC, both on server-side and client-side. When resolving files to the filesystem, the filename should be normalized to NFC. (this all assumes that your fs supports Unicode paths). NFKC becomes more relevant in the case that you want to implement something like search, or find in page, or something. You may want a user to be able to type in something like 'e' have their find include everything whose NFKC form is basically an 'e' (see the full set here: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ANFKC_Casef old%3De%3A%5D&g=&i=). 3. Language support Normalization is generally supported across different languages p easily. Python has it in its stdlib: https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize Golang has support: https://pkg.go.dev/golang.org/x/text/unicode/norm Rust: https://unicode-rs.github.io/unicode-normalization/unicode_normalization/index.html C get its support through the venerable libicu library (you're already using libs for TLS): https://unicode-org.github.io/icu/userguide/transforms/normalization/ I will say that I don't know of any explicit IRI-handling libraries, nor do I know what the state of support is in different URI-handling libraries, but it will be something I play with as I work on gemini projects. I'm happy to share my experiences when I have more of them. :) - To address some non-technical points, I don't think that starting a new protocol and then deciding to ignore internationalization is necessarily the right way to go. In a lot of cases, internationalization sucks because of legacy support, and gemini doesn't *have* legacy to preserve compatibility. As I understand it, that's why TLS is mandatory, even though it arguably locks out some retro systems from being able to use it. Personally, I'd like to see the spec say something about how this is handled before any type of freeze takes place. -- worr
> I know Solderpunk wants to do a series of freezes then thaws as things are > worked on, but I think things progress a bit faster than he can deal with, > or wants to deal with, given his long absences on the list. I'd love to see a spec freeze, too. There are already a lot of gemini servers, clients and other tools out there and breaking changes should be avoided unless absolutely necessary. > For me personally, I think this should be worked out, and I'm working > towards that with my own server [1]. I've had to make changes to > GLV-1.12556 in the past when the protocol changed, I can change it again. How about waiting for a consensus to develop, *at the very least*? If the protocol were to change to allow IRIs, that's a *major breaking* change that to me, as someone actually serving non-English content, is not only completely unnecessary but harmful. 1. I would still have to percent-encode my links to stay compatible with existing clients. 2. With clients now potentially sending IRIs and not encoded URIs as requests I would have to change the request handling in my server code to allow for this, possibly having to add third-party dependencies. 3. I'm still not convinced this would help anyone - IRIs still have reserved characters that have to be properly encoded - so completely non-technical text/gemini authors will still have to rely on proper tooling. bie
On Tue, Dec 8, 2020 at 5:10 AM Petite Abeille <petite.abeille at gmail.com> wrote: > Gemini's narrative is build upon the fallacy that everything is "just-oh-so-trivial": any Dick and Jane can fire up the most bare bone of telnet over their trusty dial-up modem and be done. Some sort of citizen-programmer-publisher nirvana, without any barriers to entry whatsoever. Well, not telnet, because of TLS, but openssl-s_client at least.
On Tue, Dec 08, 2020 at 12:46:47PM +0100, William Orr wrote: > In general, browsers will render the domain in the URI bar if all of > the characters in the each section belong to the same script. As an > example, https://?pple.com will not render correctly in Firefox in the > URI bar, but https://?????.com/ will render correctly (both domains do > not exist if you want to check). A lot of extra complexity for very little value. FWIW, first url you showed looks absolutely the same as legit "https://apple.com" I typed manually in my vim in TERM=linux. I came to gemini because for web I, inhabitant of /dev/tty1, is third-class citizen. Please, don't bring this to Gemini. If "curl gemini://foo.example/" is not good enough, than your feature is too complicated. My native language is Russian (which is not even latin-based), and goverment website has URL of "https://gosuslugi.ru", and everything works fine. If you ask me, IRI is a huge mistake.
> On Dec 8, 2020, at 12:49, bie <bie at 202x.moe> wrote: [2020-12-08T11:31:16.041Z] <bie> fff [2020-12-08T11:52:29.189Z] <bie> fuck this [2020-12-08T11:52:32.918Z] <bie> lol [2020-12-08T11:52:45.090Z] <bie> time to unsubscribe ... [2020-12-08T12:51:08.517Z] <khuxkm> my favorite was whoever's response to you saying that was "oh we're a frozen spec now?" [2020-12-08T12:51:16.729Z] <khuxkm> like YES we've been a frozen spec since, like, June ... [2020-12-08T12:51:45.154Z] <makeworld> khuxkm: I pinged Solderpunk on Masto and he got back to me very quickly saying he had read the IDN thread and was going to come to a decision soon [2020-12-08T12:51:52.567Z] <makeworld> Yeah lol [2020-12-08T12:52:08.982Z] <makeworld> I think spc is getting nerd sniped ... etc, etc, etc... Certainly you must be aware that the logs from #gemini on tilde.chat are fully accessible to everyone who can be bothered, snarky comments & all. For posterity. https://portal.mozz.us/gemini/makeworld.gq/cgi-bin/gemini-irc
> On Dec 8, 2020, at 14:17, A. E. Spencer-Reed <easrng at gmail.com> wrote: > > Well, not telnet, because of TLS, but openssl-s_client at least. TLS-based Telnet Security https://tools.ietf.org/html/draft-ietf-tn3270e-telnet-tls-00 But yes, some sort of TLS layer of one kind or another :)
On Tue, Dec 8, 2020 at 6:49 AM bie <bie at 202x.moe> wrote: > If the protocol were to change to allow IRIs, that's a *major breaking* > change that to me, as someone actually serving non-English content, is > not only completely unnecessary but harmful. > As I said at the beginning of this thread, I don't think anyone is actually arguing for a change to the protocol. What does warrant discussion is allowing IRI references as links in the text/gemini format. -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/2c55 7dee/attachment.htm>
[2020-12-08T15:01:30.527Z] <makeworld> Hello, Petite Abeille. Apparently you're watching the logs, courtesy of my server, so that you can send them back on to the mailing list. [2020-12-08T15:01:50.519Z] <makeworld> I don't see the point, other than trying to stir the pot. Please stop. [2020-12-08T15:02:34.637Z] <bie> ? [2020-12-08T15:02:39.645Z] <makeworld> The fact that this channel is logged is in the topic, it is known. None of the comments you sent were rude, as well.
On Tuesday, December 8, 2020 10:57 AM, John Cowan <cowan at ccil.org> wrote: > On Tue, Dec 8, 2020 at 6:49 AM bie <bie at 202x.moe> wrote: > ? > > > If the protocol were to change to allow IRIs, that's a *major breaking* > > change that to me, as someone actually serving non-English content, is > > not only completely unnecessary but harmful. > > As I said at the beginning of this thread, I don't think anyone is actually > arguing for a change to the protocol.? What does warrant discussion is > allowing IRI references as links in the text/gemini format. I think some people really were calling for a breaking change to the protocol. But I'm glad you're not, and I hope we can move on and stop talking about it. What you propose here is allowing IRIs in link lines only? Or do you mean allowing only IRIs for relative references? I'm unsure whether that would require an IRI parser or not, but I'd feel more confident with one. However, there is already a client torture test that *sort of* covers this. It's not designed as an IRI test, but it includes invalid characters in a link line. gemini://gemini.conman.org/test/torture/0031 That page contains a link line that looks like this: => <0032> "Beware the bad link" And the Go stdlib will actually correct this link and output a correct absolute one. So in Amfora, it will go to the correct URL, which is gemini://gemini.conman.org/test/torture/%3C0032%3E I've set up my own test that contains a more complex Unicode character: ?. It tests the path, as well as Unicode in the query strings. You can access it at: gemini://makeworld.gq/test/iri-link.gmi Go also corrects the link in that one, and it works. Allowing IRIs in link lines (maybe only for relative links to ease parsing) would solve all multi-lingual author problems. But this is still a somewhat-breaking change, as once authors start using these, other non-Go clients will likely begin to fail. And the correction that Go does is not even complete, because it will not work on query strings. And even if it did, it would not work in the Gemini way that doesn't allow pluses, etc etc. We're almost there with this one, but I still think it's a mistake, and it'll make Gemini more complex. :/ makeworld
> On Dec 8, 2020, at 18:55, colecmac at protonmail.com wrote: > > [2020-12-08T15:01:50.519Z] <makeworld> I don't see the point, other than trying to stir the pot. Please stop. > [2020-12-08T15:02:39.645Z] <makeworld> The fact that this channel is logged is in the topic, it is known. None of the comments you sent were rude, as well. Did I touch a nerve? Apologies. Back to our regular program then.
On Tue, Dec 8, 2020 at 2:09 PM <colecmac at protonmail.com> wrote: > I think some people really were calling for a breaking change to the > protocol. > But I'm glad you're not, and I hope we can move on and stop talking about > it. > What you propose here is allowing IRIs in link lines only? Yes. > Or do you mean allowing > only IRIs for relative references? > No. > I'm unsure whether that would require an IRI parser or not, It will not, because conversion can be done before parsing, other than the trivial parsing required to find the hostname and punycode it. Once that is done, converting an IRI reference to a URI reference is as straightforward as transcoding from one character set to another, and totally indifferent to the IRI format. So my two steps for IRI->URI conversion become three: 1) NFC normalization. 2) Punycode conversion of the hostname. 3) Percent-encoding: find non-ASCII characters and convert them to %nn%nn, or %nn%nn%nn, or %nn%nn%nn%nn sequences, where nn is two hex digits. It turns out that all of this is spelled out in more detail at < https://tools.ietf.org/html/rfc3987#section-3.1>. That section says not to normalize unless you have the IRI in non-digital or non-UTF* format, but since the world is not full of editors that normalize, I think Gemini clients need to do it themselves. That said, most keyboard drivers (even for hard cases like Vietnamese, which has way too many vowels to dedicate a key to each) now deliver normalized text to applications. It's good to know that some existing URI libraries support IRIs, but that section should be convincing evidence that you can change an IRI to a URI without parsing it (always excepting the domain name, which is trivial to find). But this is still a somewhat-breaking change, as once authors start using > these, other non-Go clients will likely begin to fail. And the correction > that Go does is not even complete, because it will not work on query > strings. > And even if it did, it would not work in the Gemini way that doesn't allow > pluses, etc etc. > The above transformation will work, however. Sometimes DIY is the Right Thing. > We're almost there with this one, but I still think it's a mistake, and > it'll > make Gemini more complex. :/ > It will. But in the end, if Gemini succeeds even modestly there will be more authors than programmers. [*] 72 lower-case vowel letters: 6 vowels without diacritics plus 6 vowels with vowel-quality diacritics, as in French, times 6 tone marks (one of which is "no mark") as in Chinese. And the same number in upper case. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org It's like if you meet an really old, really rich guy covered in liver spots and breathing with an oxygen tank, and you say, "I want to be rich, too, so I'm going to start walking with a cane and I'm going to act crotchety and I'm going to get liver disease. --Wil Shipley -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/6720 d39c/attachment.htm>
> > We're almost there with this one, but I still think it's a mistake, and it'll > > make Gemini more complex. :/ > > It will. But in the end, if Gemini succeeds even modestly there will be more > authors than programmers. This is the point that sticks out to me. Perhaps I was wrong. The method you outlined does not seem that complex, and it really would benefit authors. It's still a breaking change, but all existing links would still work. The most difficult part of what you outlined is the Unicode normalization, which maybe not all languages have libraries for, and would also require updating every so often. But it wouldn't be a requirement for clients at all, just something nice to have. However, it does raise a few questions: I assume you mean NFC normalization? Any other option seems nonsensical to me, but I'm also new to this in general. Would be happy to be corrected. What if the user named a domain/file/folder in a non-NFC way? Now does the server need to support NFC as well, and apply it to vhost recognition or local file paths to correctly match requests? That seems wrong. But so does the user entering something visually identical to what the the the sysadmin typed, and things not working. I'm not keen to muddle up the threads again, but it seems like this proposal completely covers IDNs as well, which is handy. Overall, I like it. The biggest thing holding me back is the fact that it will break clients, over time. But perhaps that's worth it for the ease-of-writing gain for non-English speakers. I wouldn't mind updating Amfora to support this. As I explained in my previous email, it sort of already does this by accident. Cheers, makeworld
It was thus said that the Great colecmac at protonmail.com once stated: > > I'm unsure whether that would require an IRI parser or not, but I'd feel more > confident with one. However, there is already a client torture test that *sort of* > covers this. It's not designed as an IRI test, but it includes invalid > characters in a link line. > > gemini://gemini.conman.org/test/torture/0031 > > That page contains a link line that looks like this: > > => <0032> "Beware the bad link" > > And the Go stdlib will actually correct this link and output a correct > absolute one. So in Amfora, it will go to the correct URL, which is > gemini://gemini.conman.org/test/torture/%3C0032%3E What you failed to quote from that test is: I'm not entirely sure what the proper response should be ... And it was a last minute thing to add the link to %3C0032%3E---I was thinking it was more of an Easter Egg type of thing than what the actual result should be. > I've set up my own test that contains a more complex Unicode character: ?. > It tests the path, as well as Unicode in the query strings. > You can access it at: gemini://makeworld.gq/test/iri-link.gmi I tried both the Gemini Client Torture Test 31, and your link with the Gemini portal at portal.mozz.us. The results were interesting. If failed the Gemini Client Torture Test, but loaded the page with the Unicode character on your site. So at least it supports percent encoding of characters outside the ASCII range. -spc (So that's one more data point ... )
It was thus said that the Great Petite Abeille once stated: > > [2020-12-08T12:52:08.982Z] <makeworld> I think spc is getting nerd sniped I don't know if I'm being nerd sniped or not, but I do think this has brought to my attention some encoding bugs I have---namely, I don't encode data with non-US-ASCII characters. Fixing bugs is always A Good Thing (TM). I'm also looking into just how hard it would be to support IRIs. Except for the normalization thing, it looks to be fairly straightforward, but I haven't worked on it that much yet. I've already had my preconceived notions of DNS blown out of the water over this thread (and I've implemented a DNS library [1] so that's saying something). -spc (What it says, I don't know) [1] https://github.com/spc476/SPCDNS
> 3) Percent-encoding: find non-ASCII characters and convert them to %nn%nn, > or %nn%nn%nn, or %nn%nn%nn%nn sequences, where nn is two hex digits. One extra thing: Gemini will need it's own list of reserved characters. The URI spec defines[1] this list: gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" (It also defines a list called sub-delims, but that only applies to query strings I believe, and is irrelevant to the way Gemini uses them.) These characters are reserved because of their use in other parts of a URI. But Gemini does not use all those parts, such as userinfo. I believe a reserved character list for Gemini could look like this: ":" / "/" / "#" / "?" / "[" / "]" I left fragments ("#") in, so that clients can add support for them later, if/when a header-to-fragment algorithm is defined, like exists for Markdown. But that character could be removed too, which would prevent it ever being used in that manner. 1: https://tools.ietf.org/html/rfc3986#section-2.2 makeworld
On Tue, Dec 8, 2020 at 4:10 PM <colecmac at protonmail.com> wrote: > The most difficult part of what you outlined is the Unicode normalization, > which maybe not all languages have libraries for, and would also require > updating every so often. But it wouldn't be a requirement for clients at > all, > just something nice to have. > If a client has an unnormalized IRI, it needs to normalize it before sending it to the server. That said, a 2009 study looked at a sample of 700 million HTML documents, of which only 0.02% were not in NFC already, which suggests that NFC text is already pretty dominant. I assume you mean NFC normalization? > Yes. When I speak of normalization, I mean NFC normalization exclusively. > What if the user named a domain/file/folder in a non-NFC way? Now does the > server > need to support NFC as well, and apply it to vhost recognition or local > file paths > to correctly match requests? That seems wrong. But so does the user > entering > something visually identical to what the sysadmin typed, and things not > working. > I'm okay with that just failing, as file names are not really part of text/gemini content. The difference will be obvious to the admin by checking the requested URIs from the server log against the %-encoded names of the folders. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org I am a member of a civilization. --David Brin -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/6b45 b640/attachment.htm>
On Tue, Dec 8, 2020 at 4:42 PM <colecmac at protonmail.com> wrote: > (It also defines a list called sub-delims, but that only applies to query > strings I believe, and is irrelevant to the way Gemini uses them.) > Gemini query strings can certainly be formatted like Web query strings if the client knows that's what the server expects. Simple free text isn't the only possibility. I'm going to talk about that in a posting at some point. These characters are reserved because of their use in other parts of > a URI. But Gemini does not use all those parts, such as userinfo. I > believe a reserved character list for Gemini could look like this: > > ":" / "/" / "#" / "?" / "[" / "]" > We still need the square brackets for the rare case when the host-part is an IPv6 address. The only character we could leave out with complete safety is @, and I don't think that's worth special-casing for Gemini. It's simpler and better to have the same rules for all URIs. > I left fragments ("#") in, so that clients can add support for them later, > if/when a header-to-fragment algorithm is defined, like exists for > Markdown. > +1 to leaving # reserved, not only for that reason but for the same reason as @; it's not worth making a special rule for Gemini to avoid a trivial amount of %-encoding, especially given that most file names don't have either one in their names. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org weirdo: When is R7RS coming out? Riastradh: As soon as the top is a beautiful golden brown and if you stick a toothpick in it, the toothpick comes out dry. -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/4739 3bea/attachment.htm>
> > The most difficult part of what you outlined is the Unicode normalization, > > which maybe not all languages have libraries for, and would also require > > updating every so often. But it wouldn't be a requirement for clients at all, > > just something nice to have. > > If a client has an unnormalized IRI, it needs to normalize it before sending > it to the server. Could you justify this? It's a good thing to have, but it feels like a big ask, as Unicode support and especially things like normalization are not straightforward in all languages. I don't see why it can't just be recommended and not required. > > I assume you mean NFC normalization? > > Yes.? When I speak of normalization, I mean NFC normalization exclusively. Sounds good! > > What if the user named a domain/file/folder in a non-NFC way? Now does the server > > need to support NFC as well, and apply it to vhost recognition or local file paths > > to correctly match requests? That seems wrong. But so does the user entering > > something visually identical to what the sysadmin typed, and things not > > working. > > I'm okay with that just failing, as file names are not really part of text/gemini > content.? The difference will be obvious to the admin by checking the requested > URIs from the server log against the %-encoded names of the folders. The issue is that admins are not the only ones who create folders and files. Non-technical people will as well, and a bug like this will be very confusing. Everything will look right, but it just won't work. However, I doubt this will occur very often, and it's an acceptable tradeoff to supporting Unicode. makeworld
It was thus said that the Great bie once stated: > > I know Solderpunk wants to do a series of freezes then thaws as things are > > worked on, but I think things progress a bit faster than he can deal with, > > or wants to deal with, given his long absences on the list. > > I'd love to see a spec freeze, too. There are already a lot of gemini > servers, clients and other tools out there and breaking changes should > be avoided unless absolutely necessary. > > > For me personally, I think this should be worked out, and I'm working > > towards that with my own server [1]. I've had to make changes to > > GLV-1.12556 in the past when the protocol changed, I can change it again. > > How about waiting for a consensus to develop, *at the very least*? If I waited for consensus, Gemini would not be where it is today [1]. Also, it brought out a what I consider a bug in my code (generating links from filenames) that it doesn't properly URL encode data [2]. > If the protocol were to change to allow IRIs, that's a *major breaking* > change that to me, as someone actually serving non-English content, is > not only completely unnecessary but harmful. I don't expect that an IRI will be allowed for a request, but that an IRI could be in a Gemini text file and it's up to the client to do the conversion. And it's that bit that I'm currently exploring. > 3. I'm still not convinced this would help anyone - IRIs still have > reserved characters that have to be properly encoded - so completely > non-technical text/gemini authors will still have to rely on proper > tooling. And we won't know until somebody tries. -spc [1] There's a reason why GLV-1.12556 and gemini.conman.org were the first Gemini server software and server in existance, becauxe I just went ahead and implemented it while solderpunk was still talking about it. And I think the presense of GLV-1.12556 and gemini.conman.org sparked others to get busy. And GLV-1.12556 was *NOT* following the specification at the time, as I disagreed with parts of the specification. [2] I don't have any non-ASCII file names, so it never crossed my mind to handle such things. That is a blind spot as far as I'm concerned.
colecmac at protonmail.com writes: > The issue is that admins are not the only ones who create folders and files. > Non-technical people will as well, and a bug like this will be very confusing. > Everything will look right, but it just won't work. However, I doubt this will > occur very often, and it's an acceptable tradeoff to supporting Unicode. It's arguably worse than that; consider the case where your filesystem doesn't store filenames in UTF-8 ? notably Windows stores them in UCS2. If you're treating filenames as Unicode strings and not byte arrays, and your language provides good abstractions for that, you're okay, but the upshot is that both the client and the server really do need to be Unicode aware. -- +-----------------------------------------------------------+ | Jason F. McBrayer jmcbray at carcosa.net | | A flower falls, even though we love it; and a weed grows, | | even though we do not love it. -- Dogen |
---
Previous Thread: [ANN] A Nagios (and compatible) monitoring plugin for Gemini servers