[In the spirit of Scott Pilgrim vs. the World] There has been a handful of intertwingled* conversation about the topic. To recap: 2020-12-04 Stephane Bortzmeyer got the ball rolling with "IDN with Gemini?": gemini://gemi.dev/gemini-mailing-list/messages/003788.gmi 2020-12-08 John Cowan followed with "Three possible uses for IRIs": gemini://gemi.dev/gemini-mailing-list/messages/003873.gmi 2020-12-09 Jason McBrayer contributed "Some reading on IRIs and IDNs": gemini://gemi.dev/gemini-mailing-list/messages/003923.gmi ?? To be charitable, we can also include Alex's self-described "shitpost" dated 2020-12-15 : gemini://gemi.dev/gemini-mailing-list/messages/004055.gmi [2020-12-15T01:47:20.412Z] <nytpu> sending an message to the ML making fun of the long-running spec-changing threads. i'll probably regret it, but here goes [2020-12-15T07:05:14.499Z] <nytpu> i've bitched about it but this is the first time i've really addressed the points other than in passing [2020-12-15T07:05:42.682Z] <nytpu> and even then it's more a shitpost than a real rebuttal, don't take it too seriously So what's the issue making Alex lose his marbles, thin-skin aside? It boils down to this: => gemini://?.mozz.us/?.gmi ?Hoppity hop? What do do with such a construct? Possible? Not possible? Allowed? Not allowed? First class citizen? Afterthought? How do deal with it, if at all? Decisions, decisions, decisions. Technically speaking, while text/gemini is Unicode friendly by default, the links are not. The location part must be encoded, following idiosyncratic, local customs, perhaps such as: => gemini://xn--4o8h.mozz.us/%F0%9F%90%87.gmi ?Hoppity hop? In other words, a bit of punycode + percent encoding + glossing over normalization + other niceties. Everything must be US-ASCII clean at the end of the day. Some will make the distinction between "content" vs. "addressing": [2020-12-15T07:35:09.590Z] <bie> also... this was never about internationalized content, but a lot of people like to pretend that it is [2020-12-15T07:36:40.861Z] <bie> addressing != content While there are some merits about such hair splitting -as it has be handled at different level of the stack- it distracts from the crux of the problem: => gemini://?.mozz.us/?.gmi ?Hoppity hop? vs. => gemini://xn--4o8h.mozz.us/%F0%9F%90%87.gmi ?Hoppity hop? As it stands, the first variant cannot be handled by gemini -neither in text/gemini, nor in the protocol itself- with further technical gotchas such as address resolution and what not along the way. It must be converted to the second variant, the US-ASCII one. So, what to do? This is what these various conversations are about. Exploring what the scope of the problem is, and what to do about it, if anything. So one can eventually reach an informed decision. For example: [2020-12-14T22:12:14.914Z] <remyabel> I lurk this channel and the mailing lists and keep seeing people trying to extend gemini or make it web-like, there's just no point in arguing against it [2020-12-14T22:12:28.578Z] <CoopDot> I used to be in the US-ASCII only camp but now it's more "do the bare mininum to not forbid UTF-8 'URLs' in the spec and make strong recommendations in best-practices.gmi" ^Those are the "cannot be arsed" camp: things are the way they are, and cannot be bothered to changed anything, technically speaking... we are done. The "not-my-problem" camp. [2020-12-15T07:30:13.193Z] <khuxkm> honestly my issue with the iri thread was the whole "we NEED this" and "we MUST do this it's our MORAL DUTY" [2020-12-15T07:30:52.931Z] <khuxkm> like forcing everybody to use IRIs or be non-compliant with the spec is somehow going to solve discrimination ^Those are the... hmmm... oh-so-fragile "entitled" camp. To summarize: this is a genuine choice for gemini. And not so much a technical issue. -- ????? Tangentially unrelated, as always: The Internet is for End Users https://tools.ietf.org/html/rfc8890 Terminology, Power, and Inclusive Language in Internet-Drafts and RFCs https://tools.ietf.org/id/draft-knodel-terminology-04.html
Thank you so much for the summary! I lost track of the ML for a few days and... it was just too much >.< My contention is this: I want us to support internationalization as best we can. And as far as I understand it the web has done this with punycoded domain names and percent encoded paths for years. And after a few hours of fiddling with my own cli tool gemcall (https://notabug.org/tinyrabbit/gemcall/src/master/gemcall) it doesn't look like it's too hard. Check line 35: parsed = up.urlparse(url).encode("idna") As far as I can tell gemcall can now handle gemini://[rabbit emoji that I don't know how to include...].mozz.us/ As for the path that follows, that must be percent encoded in the gemtext document. There is no way for a client to know if a path is already percent encoded or not, and percent encoding twice breaks the link. Consider this: => gemini://example.com/why-space-is-%20-in-urls.gmi We see that this needs to be percent encoded, but a tool can't reliably tell if it does or if it is already. Requiring clients to punycode domain names will break existing clients. Sorry about that, but let's just fix them instead of complaining about it. Cheers, ew0k Oh! Mandatory rabbit! ()_() (^.^) _(| |)_
This is the only time I'm going to reply to one of these threads, but I should actually say what I think: Supporting IDNs and IRIs is something I can get behind, but it simply doesn't require a spec change. Maybe a "Gemini Best Practices" change, or even (in the extreme) a companion spec detailing the basics of punycoding and percent encoding, but that's it. Firstly, on IRIs: there is literally no reason not to support them. You already percent-encode half of ascii anyways, why not just encode the rest? 100% on board. IDNs I also agree with, as long as it doesn't get too out of hand with what you're requiring from people, and as long as at least some consideration is given to people's more obscure languages that may or may not have various necessary libraries. I still think it should be supported, but I'd say to heed the robustness principle: "Be liberal in what you accept, and conservative in what you send." I support allowing people to write in unicode in gemtext, including in link lines, but the client should convert it (transparently or not) for the server, no spec change required. Look at what Lagrange did for v0.13! It even displays the unicode URL in the address bar, and deals with all the conversions so the content authors and users don't even have to think about it. That's the optimal change for an "advanced" client I'd say, and a "simple" client could just convert it to punycode/percent encoded once and display and work with that afterwards so they don't have to worry about the unicode version internally. https://gmi.skyjake.fi/lagrange/ My main complaint I have is how long these threads run, and how they completely overtake the mailing list, drowning out pretty much everything else. Even if they were arguing about something I passionately argue for, I'd still make fun of them because they're so long that they're farcical. They're full of people misreading everything that's being said (I'm within that group), people that argue about something else that's vaguely related but not really. The first thread was about IDNs and people immediately started talking about IRIs instead! (or maybe vice-versa? I can't keep the two terms straight).
Le mardi 15 d?cembre 2020, 16:05:14 CET Bj?rn W?rmedal a ?crit : > As for the path that follows, that must be percent encoded in the > gemtext document. There is no way for a client to know if a path is > already percent encoded or not, and percent encoding twice breaks the > link. Consider this: > => gemini://example.com/why-space-is-%20-in-urls.gmi We see that this > needs to be percent encoded, but a tool can't reliably tell if it does > or if it is already. This is not true, even when using IRI, reserved characters such as spaces HAVE to be percent-encoded. So, if you see a "%", it is percent encoding. If you want to link to a path containing a percent, you have to percent encode the percent, resulting in %25. As a result, percent encoding twice does not break the link, as you only percent encode what is not percent encoded already. C?me
On Tue, 15 Dec 2020 08:28:00 -0700 Alex // nytpu <alex at nytpu.com> wrote: > My main complaint I have is how long these threads run, and how they > completely overtake the mailing list, drowning out pretty much > everything else. Even if they were arguing about something I > passionately argue for, I'd still make fun of them because they're so > long that they're farcical. They're full of people misreading everything > that's being said (I'm within that group), people that argue about > something else that's vaguely related but not really. The first thread > was about IDNs and people immediately started talking about IRIs > instead! (or maybe vice-versa? I can't keep the two terms straight). I highly recommend using a client with a clearly threaded overview of incoming messages. For example, if I am not interested in a particular thread of discussion, or it has become hard to follow, I can just fold it away. We should perhaps be better at changing the subject line in cases where discussion delves into details or strays from the original topic. I believe that misreading fills an important function when discussing a specification; even a fundamentally bad-faith reading is useful. In my opinion a spec should leave as little room for interpretation as possible, and misreading exposes these little ambiguities at an earlier stage where they would otherwise later cause divergent implementations. It also makes it clear when ideas are more complex than anticipated. I think we hit an iceberg with IDN/IRI, and we've failed to properly separate discussion about its rationale from discussion about its implementation, possibly because Gemini is explicit in that its rationale also concerns its implementation details (paraphrased: it should be simple and easy to implement). > The main reason I wrote this is to clarify that I am firmly against > changing the spec, no matter how noble the causes, for anything other > than clarification or typos. There are lots of possibilities that don't > require a spec change, both in internationalization support and the > other common mailing list complaints that were long like this thread in > the past (usually about gemtext's "weaknesses"). I agree with this, at least to the point that changes, if any, should guarantee forward compatibility with older implementations. Breaking changes have a high cost and so should be reserved for breaking bugs, not nice-to-haves or workable practical shortcomings. There are many features that I think would improve the protocol when thought of as just features, but would be detrimental when considering the social burden of formalizing and implementing them. I'll happily trade those features for a stable and clear spec. -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201215/8e8d 01d7/attachment.sig>
On Tue, Dec 15, 2020 at 02:27:36PM +0100, Petite Abeille <petite.abeille at gmail.com> wrote a message of 122 lines which said: > To summarize: this is a genuine choice for gemini. And not so much a > technical issue. It is quite possible that there is unanimity on these two points. So, at least, we all agree on something.
On Tue, Dec 15, 2020 at 08:28:00AM -0700, Alex // nytpu <alex at nytpu.com> wrote a message of 100 lines which said: > My main complaint I have is how long these threads run, and how they > completely overtake the mailing list, drowning out pretty much > everything else. [I agree with Philip: reading email without a threaded email reader is a bad idea.] I'm relatively new here, so I used the mailing list but if there are other ways to discuss and follow the work on the specification (or in companion documents, such as robots.txt standard), I'd be happy to use them. Issues on a ticket tracker at a Gitlab?
On Tue, 15 Dec 2020 18:26:14 +0100, Stephane Bortzmeyer wrote: > [I agree with Philip: reading email without a threaded email reader is a > bad idea.] > > [?] other ways to discuss and follow the work on the specification [?] > Issues on a ticket tracker at a Gitlab? Not sure if Usenet newsgroups are still a thing to start these days, but I like that news does not clutter personal e-mail accounts. No idea how and where comp.protocols.gemini could be established.
On 12/15/20 4:41 PM, C?me Chilliet wrote: > So, if you see a "%", it is percent encoding. If you want to link to > a path containing a percent, you have to percent encode the percent, > resulting in %25. > > As a result, percent encoding twice does not break the link, as you > only percent encode what is not percent encoded already. So that would define a special percent-encoding for clients, where they'd encode everything except percent signs, right? So in this link: =>gemini://example.com/????-why-space-is-%20-in-urls.gmi , a client would have to percent-encode the emojis, but leave the "%20" bit alone? This seems very confusing; it's also not one-to-one (encoding then decoding "%20" gives " " back)... And if you just skip percent-encoding when the only "encodable" characters in the path are percent signs, that's confusing too. That rule also doesn't work on =>gemini://example.com/%XY.gmi Also, if an author wants to link to "why-space-is-%20-in-urls.gmi" at example.com, the only option would be to write =>gemini://example.com/why-space-is-%2520-in-urls.gmi This introduces a pitfall for authors: they never have to think about percent-encoding, *except* when there are percent signs in the path. How is this better than agreeing that link paths in gemtext are always completely percent-encoded? In that case, clients can percent-decode the path and display that. Authors could use a tool that 'fully' (as in, it also turns every "%" into "%25") percent-encodes a link for them. Counterintuitively, in this way I think mandating completely percent-encoded paths in gemtext link lines might actually result in easier linking for authors. The same (clients may/should display, authors use tool) could be done with internationalised domain names (could be the same tool that does the percent-encoding), but crucially there is no ambiguity there, because an ascii domain name with "xn--" is unrepresentable in punycode and disallowed (I think). On the other hand, allowing anything whatsoever in the domain name and nothing in the path would be strange and a bit inconsistent. Assuming we don't do IRI paths in gemtext link lines, I don't really have an clear opinion regarding IDNs, the choice is between:
Le mardi 15 d?cembre 2020, 20:11:12 CET PJ vM a ?crit : > So that would define a special percent-encoding for clients, where > they'd encode everything except percent signs, right? So in this link: > =>gemini://example.com/????-why-space-is-%20-in-urls.gmi > , a client would have to percent-encode the emojis, but leave the "%20" > bit alone? This seems very confusing; it's also not one-to-one (encoding > then decoding "%20" gives " " back)... And if you just skip > percent-encoding when the only "encodable" characters in the path are > percent signs, that's confusing too. That rule also doesn't work on > =>gemini://example.com/%XY.gmi Because this is not a valid link, neither URI nor IRI. > Also, if an author wants to link to "why-space-is-%20-in-urls.gmi" at > example.com, the only option would be to write > =>gemini://example.com/why-space-is-%2520-in-urls.gmi > This introduces a pitfall for authors: they never have to think about > percent-encoding, *except* when there are percent signs in the path. Yes, and spaces, and delimiter characters, such as "/". > How is this better than agreeing that link paths in gemtext are always > completely percent-encoded? In that case, clients can percent-decode the > path and display that. Authors could use a tool that 'fully' (as in, it > also turns every "%" into "%25") percent-encodes a link for them. Because a completely percent encoded link is hell to read and to write, for instance: gemini://gemini.circumlunar.space/%64%6f%63%73/%66%61%71%2e%67%6d%69 So I think you do not mean ?completely percent-encoded?, you mean percent encode non-ascii non-reserved text, and you feel like this is better because you are use to english and ascii. But you will always need to remember which chars you need to percent encode. You will never be able to use "/" in a file name without percent encoding. Or "?". > Counterintuitively, in this way I think mandating completely > percent-encoded paths in gemtext link lines might actually result in > easier linking for authors. No, it is just a different set of characters to percent encode. > The same (clients may/should display, authors use tool) could be done > with internationalised domain names (could be the same tool that does > the percent-encoding), but crucially there is no ambiguity there, > because an ascii domain name with "xn--" is unrepresentable in punycode > and disallowed (I think). On the other hand, allowing anything > whatsoever in the domain name and nothing in the path would be strange > and a bit inconsistent. Yes, IDN are covered by punycode, but the question remains whether I am allowed to use the unicode form in a link line. => gemini://g?meaux.example.com Is that legal? > Assuming we don't do IRI paths in gemtext link lines, I don't really > have an clear opinion regarding IDNs, the choice is between: > * all clients need to convert to punycode when following a link, authors > can easily link to IDNs without a tool (though they're already using a > tool for unicode paths), somewhat inconsistent/strange > * fancy clients will convert from punycode when displaying a link, > authors need a tool to be able to easily make links to IDNs (though > they're already using a tool for unicode paths) Yes. I am for IDN in link lines, but I am also in favor of IRI in link lines. And I would be supportive of using IRI in request line also for that matter. And redirect responses. C?me
It was thus said that the Great Alex // nytpu once stated: > > My main complaint I have is how long these threads run, and how they > completely overtake the mailing list, drowning out pretty much > everything else. You haven't been here long, have you? Becaus for *months* this list talked almost exclusively about text/gemini. Just check the threaded archives [1] and look upon the threads, ye mighty, and despair! Take special note how long the thread "Text reflow woes" goes (from 2019 well into 2020). It's also worth to note that different people have different expectations as to volume of email. I've been on lists where people freak out if they get more than 1 email per day. Personally, I consider the volume of this list to be low-to-mid levels of volume. One list I was one (it is no longer around) would typically get around double digits of email per day, and on one memerable day, hit 500 messages (yes, 500 email in a single day---that set my expectations on what a "high-volume mailing list" is). -spc [1] https://lists.orbitalfox.eu/archives/gemini/2019/thread.html https://lists.orbitalfox.eu/archives/gemini/2020/thread.html
> How is this better than agreeing that link paths in gemtext are always > completely percent-encoded? In that case, clients can percent-decode the > path and display that. Authors could use a tool that 'fully' (as in, it > also turns every "%" into "%25") percent-encodes a link for them. > > Counterintuitively, in this way I think mandating completely > percent-encoded paths in gemtext link lines might actually result in > easier linking for authors. This is -- as I read it -- what the spec requires now. I think that's the best solution. The wording in the spec can (and maybe should) be clarified, though. > * all clients need to convert to punycode when following a link, authors > can easily link to IDNs without a tool (though they're already using a > tool for unicode paths), somewhat inconsistent/strange > * fancy clients will convert from punycode when displaying a link, > authors need a tool to be able to easily make links to IDNs (though > they're already using a tool for unicode paths) I think that all clients *should* convert links to punycode. If they did authors could write punycoded or unicode domains in their links and both would work. Right now authors can't expect clients to punycode for them, so the safest recourse is to punycode links yourself before publishing. Note that none of this requires a spec change (except for maybe clarifying the percent encoding of links in gemtext). I think it's fair to assume that IDNs will just work, and if they don't work in a browser/client we can report that as a bug (or send a PR that fixes it). After all IDNs have existed for some years, and URL libs across languages are very likely to support it. Cheers, ew0k ??
Bj?rn W?rmedal <bjorn.warmedal at gmail.com> writes: >> How is this better than agreeing that link paths in gemtext are always >> completely percent-encoded? In that case, clients can percent-decode the >> path and display that. Authors could use a tool that 'fully' (as in, it >> also turns every "%" into "%25") percent-encodes a link for them. >> >> Counterintuitively, in this way I think mandating completely >> percent-encoded paths in gemtext link lines might actually result in >> easier linking for authors. > > This is -- as I read it -- what the spec requires now. I think that's > the best solution. The wording in the spec can (and maybe should) be > clarified, though. I don't think this is going to be acceptable for authors. It's unreasonable to ask authors to use a tool other than their favorite text editor to write gemtext. Why is it reasonable for the client to have to punycode the domain (an uncommon encoding for which not every common language has a library), but unreasonable for it to have to urlencode the path (a common encoding for which libraries are ubiquitous)? Why is it so hard to convince people to just do the right thing? ??? -- +-----------------------------------------------------------+ | Jason F. McBrayer jmcbray at carcosa.net | | A flower falls, even though we love it; and a weed grows, | | even though we do not love it. -- Dogen |
On 12/15/20 9:00 PM, C?me Chilliet wrote: >> =>gemini://example.com/%XY.gmi > Because this is not a valid link, neither URI nor IRI. My thinking was that it is a valid link after percent-encoding. But OK, so the client would percent-encode exactly those characters that are not reserved but not in ascii. That would indeed be unambiguous. It would not be one-to-one with percent-decoding, though, which is unavoidable with this approach to IRIs. > a completely percent encoded link ... Yes, that was a misuse of the word "completely" on my part > you feel like this is better because you are use to english and > ascii. That is a failed attempt at mind-reading. > But you will always need to remember which chars you need to percent > encode. You will never be able to use "/" in a file name without > percent encoding. Or "?". Yes, when someone wants to link to a resource with "?", "/" or "#" in the filename, that will basically always require manual intervention. One error in my previous email was that of course, you can also use a tool to percent-encode just spaces and percent signs for you. There's not much difference in what the author has to think about, then. Still, with both IRI paths and IDNs, I'm not really seeing the "added value" of having them in the spec. I'm quite sure they will be there either way: if it doesn't get into the spec, it is still possible for clients to provide the same experience with (seemingly) about the same amount of programming effort - and it seems plenty of client authors would -, and authors would not be much worse off if they use a tool. Meanwhile, the negatives are rather visible to me: they're breaking changes, they increase the complexity that a client *must* have. -- pjvm
> I don't think this is going to be acceptable for authors. Maybe not. I don?t really know. > It's > unreasonable to ask authors to use a tool other than their favorite text > editor to write gemtext. Is it? Unreasonable is a strong word here. I assume there would be some servers out there that would do this on the fly when serving gemtext, but I can?t know that for sure. There could also be a CLI tool you can run on your file that fixes links. Or some other solution. > Why is it reasonable for the client to have to > punycode the domain (an uncommon encoding for which not every common > language has a library), I made the assumption that most languages dealing in stuff like URLs would have support for it. I may be in the wrong there. I also made the assumption that punycoding was common, but I may be in the wrong there too. Which method *is* common? > but unreasonable for it to have to urlencode > the path (a common encoding for which libraries are ubiquitous)? Because ? as I tried to point out ? there is no reasonably simple heuristic for determining whether a URL is already percent encoded or not. And percent encoding a URL that is already percent encoded exchanges all % characters with %25. Attempting to punycode a domain name that is already punycoded, however, changes nothing at all. No heuristics are needed, the client can just punycode everything. > Why is > it so hard to convince people to just do the right thing? Why are you so adamantly convinced that *you* are arguing for ?the right thing?? Is there an objective measurement here that you may share with me? ????
On Tue, Dec 15, 2020 at 2:11 PM PJ vM <pjvm742 at disroot.org> wrote: > This introduces a pitfall for authors: they never have to think about > percent-encoding, *except* when there are percent signs in the path. > Or spaces, because in a link line a space terminates the URI. So if the author wants to link to "gemini://example.com/foo bar", the author *must* write gemini://example.com/foo%20.bar. In principle you have the same problem with wanting line endings in a URI, though they are much less likely to be an issue. All this boils down to this question: Who should pay the price for i18n in links, clients or authors?" Any third alternative is hacky at best (there is typically no library routine for "encode everything that needs to be encoded except in the sequences %20 and %25") and broken at worst. How is this better than agreeing that link paths in gemtext are always > completely percent-encoded? I don't understand. Do you mean "link paths are already %-encoded when you get them" (status quo) or "link paths must be %-encoded when you get them" (IRIs in link lines)? On Tue, Dec 15, 2020 at 3:00 PM C?me Chilliet <come at chilliet.eu> wrote: Because a completely percent encoded link is hell to read and to write, for > instance: > gemini://gemini.circumlunar.space/%64%6f%63%73/%66%61%71%2e%67%6d%69 +1 > Yes. > I am for IDN in link lines, but I am also in favor of IRI in link lines. > +1 > And I would be supportive of using IRI in request line also for that > matter. And redirect responses. > -1. That's a change to the protocol, and only protocol agents (clients, servers) should see such lines; it doesn't matter how ugly they are. So "machines speak URIs, humans speak IRIs". John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org Uneasy lies the head that wears the Editor's hat! --Eddie Foirbeis Climo -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201216/8799 1be9/attachment-0001.htm>
Hello, Jason McBrayer writes: > It's unreasonable to ask authors to use a tool other than > their favorite text editor to write gemtext. Yepp. 1+ Cheers, Erich -- Keep it simple!
It was thus said that the Great Bj?rn W?rmedal once stated: > > > but unreasonable for it to have to urlencode the path (a common encoding > > for which libraries are ubiquitous)? > > Because ? as I tried to point out ? there is no reasonably simple > heuristic for determining whether a URL is already percent encoded or not. > And percent encoding a URL that is already percent encoded exchanges all % > characters with %25. Attempting to punycode a domain name that is already > punycoded, however, changes nothing at all. No heuristics are needed, the > client can just punycode everything. I can't say for certain what most clients do, but I'm under the impression that some (the majority?) use some existing library to parse links. The specification states that relative links are allowed in text/gemini: => ../%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%B8%F0%9D%93%8E.txt Some ????? stuff here but a full URI needs to be sent to the server, so some processing of the link is required (specifically, section 5.2 of RFC-3986). And existing libraries help here. The library I'm currently using will parse the above link into the following structure: { path = "../?????.txt" } Note how the text has been translated and any percent encoding has been decoded. Next, the base URL of the page: gemini://example.com/files/others/ has previously parsed (because it was needed to retrieve the page currently being viewed): { path = "/files/others/", port = 1965.000000, host = "example.com", scheme = "gemini", } The two are then merged into a single reference: { path = "/files/?????.txt" port = 1965.000000, host = "example.com", scheme = "gemini", } Then to make a request, this new link is converted into a URI to make the request: gemini://example.com/files/%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%B 8%F0%9D%93%8E.txt As you can see, that process has re-encoded the path, percent-encoding it. I would expect that some (the majority?) of clients are doing something similar to this---doing a conversion from percent-encoding, marging references, then converting to percent-encoding (except for the host, which needs to be converted to punycode). It would be instructive to know how clients are handling this---do they decode percent-encoded data, merge the base link to the relative link and re-encode? Or something different? -spc
> It would be instructive to know how clients are handling this---do they > decode percent-encoded data, merge the base link to the relative link and > re-encode? Or something different? > > -spc My clients (gemget, Amfora) are in Go, so I just `Parse` both the base link and the relative link, and then use `base.ResolveReference(rel)`. This means I don't have to do any decoding or anything at all. URL.Path and URL.RawPath can be used to get the decoded and encoded path respectively, although I have no need in this context. https://golang.org/pkg/net/url/#URL https://golang.org/pkg/net/url/#Parse https://golang.org/pkg/net/url/#URL.ResolveReference makeworld
How does a client handle a link like the following: => essays/why-spaces-are-%20-in-URLs.gmi The assumption here is that the author has not percent encoded themselves -- this is the actual filename, %20 and all. How can the client tell if it's percent encoded or not? If you start by decoding it you distort the filename. If you just assume it isn't percent encoded and go ahead and do that you will handle this link correctly but break any links that are already percent encoded. I've only done this in python, using the urllib.parse library. I can tell that to encode or decode, but it will do what I tell it to without exception. It's up to me to build logic that avoids breaking the edge cases. We can decide to *always* percent encode links in gemtext (as the spec states now) or to *never* do it, but I don't see how we can reasonably have both. And never doing it means we can never link to a file with spaces in the URL, and will have to percent decode anything we copy paste from web browser's address bar. There will be extra work for authors either way. Consider another hypothetical case: => teddybearoftheyear.com/vote?ew0k%20The%20Great Vote for me! How would you solve that? However much I *want* to have IRIs and IDNs in gemtext and leave the work to clients and servers, I don't have a solution for that as an implementer. Cheers, ew0k
It was thus said that the Great Bj?rn W?rmedal once stated: > How does a client handle a link like the following: > => essays/why-spaces-are-%20-in-URLs.gmi > > The assumption here is that the author has not percent encoded > themselves -- this is the actual filename, %20 and all. > > How can the client tell if it's percent encoded or not? If you start > by decoding it you distort the filename. If you just assume it isn't > percent encoded and go ahead and do that you will handle this link > correctly but break any links that are already percent encoded. I've > only done this in python, using the urllib.parse library. I can tell > that to encode or decode, but it will do what I tell it to without > exception. It's up to me to build logic that avoids breaking the edge > cases. > > We can decide to *always* percent encode links in gemtext (as the spec > states now) or to *never* do it, but I don't see how we can reasonably > have both. And never doing it means we can never link to a file with > spaces in the URL, and will have to percent decode anything we copy > paste from web browser's address bar. There will be extra work for > authors either way. > > Consider another hypothetical case: > => teddybearoftheyear.com/vote?ew0k%20The%20Great Vote for me! > > How would you solve that? > > However much I *want* to have IRIs and IDNs in gemtext and leave the > work to clients and servers, I don't have a solution for that as an > implementer. I don't have a solution either, and while trying to nail down every possible corner case is admirable, sometimes, you just have to say, "don't do that!" (or in other words, document or warn about the corner case). It's already the case on Unix systems where a file name can technically have any character other than '/' (because it's the path separator) and NUL (marks the end of the string), but I doubt you'll find any filenames with control characters [1] or even "problematic characters because of the shell" like "&", "?", or "*" in them. People just kind of learn what they can and can't use for filenames over time. In fact, that might be an interesting thing for Lupa [2] or GUS to report on---characters found in filenames [3]. I'm not sure how apropos this is, but years ago, when I was at university studying Computer Science, I was writing a program (for a friend, not course related) where I wanted to log errors so they would later be seen (as the program would run unattended, and any messages to the display would not be seen). I could log to a file, but the disk could fill up. Okay, if that happened, I could log to the printer, but there might not be a printer (or it could be turned off---this back when printers were hooked directly to a computer). I asked one of my instructors (who worked at IBM, and was on the team for one of the first Fortran compilers for IBM) what I should do. His advice was (and as sad as this is, it's pretty true), if you don't know how to handle an error, don't bother looking for it. -spc > Cheers, > ew0k [1] Unless it's for pranking someone, not that I would know that. [2] St?phane's new research crawler for Gemini. [3] This reminds me, I have a new feature on my own server that allows one to dive into a ZIP file: gemini://gemini.conman.org/test/UCSD-Pascal-source.zip/ vs. gemini://gemini.conman.org/test/UCSD-Pascal-source.zip Right now it's not much of an issue since the filenames for the "proof-of-concept" file are just plain ASCII, but in the general case, I suppose I should support conversion of filenames to UTF-8, but that might be a hard case as well, as character encodings aren't readily recorded in ZIP files.
It was thus said that the Great Bj?rn W?rmedal once stated: > How does a client handle a link like the following: > => essays/why-spaces-are-%20-in-URLs.gmi > > The assumption here is that the author has not percent encoded > themselves -- this is the actual filename, %20 and all. And speaking of this, test #31 of the Gemini Client Torture Test [1] has this exact case---the link contains characters that should be encoded but aren't. It's been interesting to see which clients get an error, and which ones encode the bad characters. And for this test, there is no right answer---it's there to inform implementors that you'll encounter wrong stuff all the time, and you better be prepared to do *something* [2]. -spc [1] gemini://gemini.conman.org/test/torture/0031 [2] Not withstanding the advice I presented in my previous reply to this. Sometimes, crashing *is* a valid response to some unknown state, but it really depends upon the context of the program [3]. [3] I can expand on this if anyone cares.
Bj?rn W?rmedal <bjorn.warmedal at gmail.com> writes: > Because ? as I tried to point out ? there is no reasonably simple > heuristic for determining whether a URL is already percent encoded or > not. And percent encoding a URL that is already percent encoded > exchanges all % characters with %25. It's not that hard. All you have to do is percent decode the path *first*, then percent encode it. Consider this URL, which is a worst-case for what you're talking about: gemini://example.com/?%20?.gmi Unquoting the path gives you 'gemini://example.com/? ?.gmi', of course. And then quoting it gives you 'gemini://example.com/%F0%9F%90%87%20%F0%9F%A5%95.gmi' which decodes correctly. Unquoting a path that is already plain ASCII does nothing to it. -- Jason McBrayer | ?Strange is the night where black stars rise, jmcbray at carcosa.net | and strange moons circle through the skies, | but stranger still is lost Carcosa.? | ? Robert W. Chambers,The King in Yellow
Bj?rn W?rmedal <bjorn.warmedal at gmail.com> writes: > How does a client handle a link like the following: > => essays/why-spaces-are-%20-in-URLs.gmi > > The assumption here is that the author has not percent encoded > themselves -- this is the actual filename, %20 and all. This doesn't work in HTML/HTTP, either. Go to https://jfm.carcosa.net/testme.html, look at the source, see what happens with each link. The web server is Apache. The upshot is that to include %, or any other reserved character in the link, you do need to pre-encode it in your source. That's obvious for ' ', because of the syntax of links in gemtext. But it's also true of %, etc. -- Jason McBrayer | ?Strange is the night where black stars rise, jmcbray at carcosa.net | and strange moons circle through the skies, | but stranger still is lost Carcosa.? | ? Robert W. Chambers,The King in Yellow
On Thu, Dec 17, 2020 at 2:39 AM Bj?rn W?rmedal <bjorn.warmedal at gmail.com> wrote:\ How can the client tell if it's percent encoded or not? If you start > by decoding it you distort the filename. If you just assume it isn't > percent encoded and go ahead and do that you will handle this link > correctly but break any links that are already percent encoded. Exactly. To make things worse, space is a protocol element in link lines and *can't* be left unencoded by the author, whichever way we choose. > We can decide to *always* percent encode links in gemtext (as the spec > states now) or to *never* do it, but I don't see how we can reasonably > have both. I agree. But what we can have (and it's messy, but not as messy as the alternatives) is "authors encode percent and space" and "clients encode all other reserved and non-ASCII characters." > Consider another hypothetical case: => teddybearoftheyear.com/vote?ew0k%20The%20Great Vote for me! > That's the best you can do. But in the case where the link line is > => teddybearoftheyear.com/vote?????%20??????? > <http://teddybearoftheyear.com/vote?ew0k%20The%20Great> ??????? ?? ????! [1] > [2] [3] > then the client must translate it for sending over the wire into gemini:// teddybearoftheyear.com/vote?%D0%98%D0%B2%D0%B0%D0%BD%20%D0%93%D1%80%D0%BE%D 0%B7%D0%BD%D1%8B%D0%B9 <http://teddybearoftheyear.com/vote?ew0k%20The%20Great> because making the author type all that is wholly abominable. Online URL-encoders are not that helpful, because they give you + instead of %20. [1] This is Ivan the Terrible, who for most of his life was actually a quite effective tsar despite his (occupational) paranoia and a serious outbreak of madness just before he died; a better translation would be "Ivan the Formidable". Still, nobody would call him a teddy bear (and so his ukase "Vote for me!" would probably be in vain). [2] The latest spec change makes this line incorrect unless " teddybearoftheyear.com/vote" is to be interpreted as a relative path. It needs to be prefixed by "gemini://" or at the very least "//". [3] If the space had not been %-encoded by the author, the Tsar's second name would be part of the link name and not part of the IRI. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org 'My young friend, if you do not now, immediately and instantly, pull as hard as ever you can, it is my opinion that your acquaintance in the large-pattern leather ulster' (and by this he meant the Crocodile) 'will jerk you into yonder limpid stream before you can say Jack Robinson.' --the Bi-Coloured-Python-Rock-Snake -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201217/7e96 8ab6/attachment.htm>
Help! I'm getting pulled in! On IRC, I wrote: [2020-12-14T21:26:44Z] <CoopDot> I'm staying out of debating IDN/IRI on the ML. What I've had to say has already been said more than once. My position have even shifted a bit since the threads started Some discussion happened and then I wrote something that got quoted here: Petite Abeille <petite.abeille at gmail.com> wrote: For example: [2020-12-14T22:12:14.914Z] <remyabel> I lurk this channel and the mailing lists and keep seeing people trying to extend gemini or make it web-like, there's just no point in arguing against it [2020-12-14T22:12:28.578Z] <CoopDot> I used to be in the US-ASCII only camp but now it's more "do the bare mininum to not forbid UTF-8 'URLs' in the spec and make strong recommendations in best-practices.gmi" ^Those are the "cannot be arsed" camp: things are the way they are, and cannot be bothered to changed anything, technically speaking... we are done. The "not-my-problem" camp. I'm assuming including me here was intentional. I truly can't tell if that is an accurate description of my possession. "I used to be in the US-ASCII only camp" refers to me no longer thinking requiring everything to be encoded to pass as US-ASCII is the best idea. This is me moving away from the status quo towards a possible compromise. Or am I missing where we're going? "Do [...] not forbid UTF-8 'URLs' in the spec". Not forbidding is almost like allowing. We should attempt to not paint our selfs into a corner or bet on the wrong horse. ? "Make strong recommendations in best-practices.gmi" because we have to address it somewhere. Earlier in the same email, Petite Abeille <petite.abeille at gmail.com> wrote: It boils down to this: => gemini://?.mozz.us/?.gmi ?Hoppity hop? What do do with such a construct? Possible? Not possible? Allowed? Not allowed? First class citizen? Afterthought? How do deal with it, if at all? [...] => gemini://?.mozz.us/?.gmi ?Hoppity hop? vs. => gemini://xn--4o8h.mozz.us/%F0%9F%90%87.gmi ?Hoppity hop? As it stands, the first variant cannot be handled by gemini -neither in text/gemini, nor in the protocol itself- with further technical gotchas such as address resolution and what not along the way. It must be converted to the second variant, the US-ASCII one. Let's examine the situation: ? The capsule author writes this link line in their text editor: => gemini://?.mozz.us/?.gmi ?Hoppity hop? The text editor may or may not change the syntax highlight to indicate an error with the URL. When saving the file, the text editor has an opportunity to "correct" the error by itself. Let's say the text editor is oblivious and the capsule author doesn't run the file through a linter. The file is ready to be served. A visitor requests the file. The server has an opportunity to scan the file before serving, but that would in most cases be a complete waste of resources, so it doesn't. The client parses the file. It has a choice to render the link line as a link or as text. (It could also brake at the first sight of bunny, but let's assume it doesn't.) The link is only a problem if the visitor is following it. At this point, it doesn't matter if the visitor follows a link or writes the URL in the address bar. The client has a choice to translate or not translate the URL before making the request. Domain name resolution is outside of the scope of the Gemini specification, we don't know if it can handle UTF-8 or not. If the visitor's network administrator has set up name resolution to accept UTF-8, they should probably also accept the punycoded version for compatibility. Let's assume "always punycode" is a safe option, the client has a choice of being proactive and do the translation or ignore it and let it fail if it will. I say both options are valid and the Gemini specification should at most refer to other specifications on this. (The third option to just refuse to connect is bad.) Moving on: (We will go back later.) We have the IP address and the request has reached the server. Let's assume this is over the regular internet and a punycoded domain is a must. The server compares "xn--4o8h.mozz.us <http://xn--4o8h.mozz.us/%F0%9F%90%87.gmi>" with whatever virtual hosts the server administrator has set up in the configuration file. Is it unreasonable for the administrator to expect the server software to match "?.mozz.us" in the configuration file to "xn--4o8h.mozz.us <http://xn--4o8h.mozz.us/%F0%9F%90%87.gmi>" coming in over the wire? How about the other way around? It's a local network and ASCII non-conforming bunnies hops into the server and the administratior has only specified the punicode in the configuration file. Is it unreasonable to expect it to match? Reasonable or not, let's assume the virtual host is set up properly and go back in time to the client making the request. What do we do about the path? Should the client "help" the visitor by %-encoding non-ASCII bytes or send it as is and hope for the best? Should the client %-encode reserved characters the visitor writes in the address bar or let them fail? Anyway, the request reaches the server. "%20" become space and "%2b" become plus. I see no reason why it would be hard to also convert "%F0%9F%90%87" into bytes, so I will assume it isn't and wait for server software programmers to tell me how wrong I am. So now we have a string of bytes that we can use to fetch the bunny file. Wait. What happened with the case where the bunny isn't %-encoded? Why can't servers just blindly accept non-ASCII bytes as is? Is it a library thing? Anyway, I really should test this in a bunch of languages but I'm writing this on my phone on my way to work, so instead I present you this pseudo code: ``` "%F0%9F%90%87".url_decode() == "\xF0\x9F\x90\x87".url_decode() "%F0%9F%90%87".url_decode() == "\xF0\x9F\x90\x87" "\xF0\x9F\x90\x87" == "?" ``` If these 3 lines are all true for the server software, I see no reason to %-encode those non-ASCII bytes in the client or anywhere else. Surely I have missed something obvious somewhere. Can anyone help me? Maybe I just need coffee... ? -- Katarina (Please regard these ramblings as non-rhetorical) -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201218/0a0b a47e/attachment-0001.htm>
> On Dec 18, 2020, at 07:13, Katarina Eriksson <gmym at coopdot.com> wrote: > > Help! I'm getting pulled in! Katarina! Thanks for dropping by! Welcome to the party! ? ? > I'm assuming including me here was intentional. I truly can't tell if that is an accurate description of my possession. Thanks for noticing. Timing is everything. See Cunningham's Law ? > "I used to be in the US-ASCII only camp" refers to me no longer thinking requiring everything to be encoded to pass as US-ASCII is the best idea. This is me moving away from the status quo towards a possible compromise. Or am I missing where we're going? Indeed, this is the crux of the issue, the notorious IRI vs. URI chasm: native UTF vs ASCII encoded. > I see no reason to %-encode those non-ASCII bytes in the client or anywhere else. Surely I have missed something obvious somewhere. Can anyone help me? Genau. As it stands, the spec mandates URIs -therefore ASCII only- making UTF IRIs V E R B O T E N! NICHT GUT! NOT COMPLIANT! ? ? Now that we all took time to survey the lay of the land, the question is: should the specification be amended to refer to IRI (urn:ietf:rfc:3987), instead of URI (urn:ietf:rfc:3986)? As simple as that. That's all folks! ????
>> => teddybearoftheyear.com/vote?????%20??????? ??????? ?? ????! [1] [2] [3] On a technical note: in some libraries you may have to split the URL and encode the path, fragment, query string and parameters separately. Otherwise the separators (#, ?, ;) may be encoded as part of the path. ... For me as an implementer this is starting to look a bit frustrating. I want to please people, but I also don't want to have all the lifejoy sucked out of me because I have to twist myself into knots in order to properly understand and implement the protocol. I'll bow out of this discussion now and follow it from a distance, hoping that whatever decision is reached isn't too complicated to implement. Cheers, ew0k
> On Dec 18, 2020, at 10:59, Bj?rn W?rmedal <bjorn.warmedal at gmail.com> wrote: > > I'll bow out of this discussion now and follow it from a distance, > hoping that whatever decision is reached isn't too complicated to > implement. The /mild/ complexity arises from URI escaping rules. If anything, IRI simplify things a bit for all concerned. Consider the following URIs (as per the current spec, which all MUST support one way or another): => gemini://rabbit.hole/bunny%20%26%20carrot.gmi Bunny & Carrot: Down The Rabbit Hole, a journey. => gemini://rabbit.hole/%F0%9F%90%B0%20%26%20%F0%9F%A5%95.gmi ? & ?: Down The Rabbit Hole, a journey. => gemini://xn--yn8h.hole/%F0%9F%90%B0%20%26%20%F0%9F%A5%95.gmi ? & ?: Down The Rabbit Hole, a journey. vs. IRIs: => gemini://rabbit.hole/bunny%20%26%20carrot.gmi Bunny & Carrot: Down The Rabbit Hole, a journey. => gemini://rabbit.hole/?%20%26%20?.gmi ? & ?: Down The Rabbit Hole, a journey. => gemini://?.hole/?%20%26%20?.gmi ? & ?: Down The Rabbit Hole, a journey.
Katarina Eriksson <gmym at coopdot.com> writes: > > Anyway, the request reaches the server. "%20" become space and "%2b" become > plus. I see no reason why it would be hard to also convert > "%F0%9F%90%87" into bytes, so I will assume it isn't and wait for server > software programmers to tell me how wrong I am. > > So now we have a string of bytes that we can use to fetch the bunny file. > Wait. What happened with the case where the bunny isn't %-encoded? Why > can't servers just blindly accept non-ASCII bytes as is? Is it a library > thing? Anyway, I really should test this in a bunch of languages but I'm > writing this on my phone on my way to work, so instead I present you this > pseudo code: > > *ELIDED TEXT HERE* > > If these 3 lines are all true for the server software, I see no reason to > %-encode those non-ASCII bytes in the client or anywhere else. Surely I > have missed something obvious somewhere. Can anyone help me? The Space Age server uses java.net.URI to parse incoming URI strings into their component parts. It can accept URIs with unencoded UTF-8 path, query, and fragment parts (except that spaces must be percent-encoded as %20). Unicode is not allowed in the hostname part. One more data point for you, Gary -- GPG Key ID: 7BC158ED Use `gpg --search-keys lambdatronic' to find me Protect yourself from surveillance: https://emailselfdefense.fsf.org ======================================================================= () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments Why is HTML email a security nightmare? See https://useplaintext.email/ Please avoid sending me MS-Office attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
> On Dec 18, 2020, at 18:16, Gary Johnson <lambdatronic at disroot.org> wrote: > > The Space Age server uses java.net.URI to parse incoming URI strings > into their component parts. It can accept URIs with unencoded UTF-8 > path, query, and fragment parts (except that spaces must be > percent-encoded as %20). Unicode is not allowed in the hostname part. Perhaps of interest: xbib/net: Sane URL, URI, IRI implementations for Java https://github.com/xbib/net
Petite Abeille <petite.abeille at gmail.com> writes: > Perhaps of interest: > xbib/net: Sane URL, URI, IRI implementations for Java > https://github.com/xbib/net Thanks for the link. I gave it a shot, but it appears to be buggy and doesn't have any documentation. I ended up reading through the source code on Github to figure out how to call its API, but sadly it looks like it can't correctly identify the host part of the incoming string. Instead, it thinks it is part of the path, which is obviously no good. space-age.requests> (parse-url "gemini://?.mozz.us/%20?.gmi?some-key=?&?=some-value#?-fragment") {:path "/?.mozz.us/ ?.gmi", :raw-query "some-key=%F0%9F%90%87&%F0%9F%90%87=some-value", :fragment "?-fragment", :params ["some-key=?" "?=some-value"], :port 1965, :host "", :raw-fragment "%F0%9F%90%87-fragment", :uri "gemini://?.mozz.us/%20?.gmi?some-key=?&?=some-value#?-fragment", :query "some-key=?&?=some-value", :raw-path "/%F0%9F%90%87.mozz.us/%20%F0%9F%90%87.gmi", :raw-host "", :scheme "gemini"} Oh well, I guess I'll stick with java.net.URI for now. Cheers, Gary -- GPG Key ID: 7BC158ED Use `gpg --search-keys lambdatronic' to find me Protect yourself from surveillance: https://emailselfdefense.fsf.org ======================================================================= () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments Why is HTML email a security nightmare? See https://useplaintext.email/ Please avoid sending me MS-Office attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
> On Dec 19, 2020, at 00:38, Gary Johnson <lambdatronic at disroot.org> wrote: > > Thanks for the link. I gave it a shot, but it appears to be buggy Right. It appears to know a fixed list of schemes. See SchemeRegistry: https://github.com/xbib/net/blob/master/net-url/src/main/java/org/xbib/net/ scheme/SchemeRegistry.java#L18 Perhaps one needs to register its own to extend it. perhaps something similar to HttpScheme: https://github.com/xbib/net/blob/master/net-url/src/main/java/org/xbib/net/ scheme/HttpScheme.java#L27
On Fri, 18 Dec 2020 07:13:24 +0100 Katarina Eriksson <gmym at coopdot.com> wrote: > Domain name resolution is outside of the scope of the Gemini specification, > we don't know if it can handle UTF-8 or not. If the visitor's network > administrator has set up name resolution to accept UTF-8, they should > probably also accept the punycoded version for compatibility. IDNA moves what is ideally part of DNS into the application layer, which is what the A stands for. It was somehow decided when adopting this standard that it was better that every application that wants to use a hostname should implement IDNA than to fix the underlying problem in DNS. This probably helped adoption early on because ISPs could largely leave the cards in their card houses as they were, but creates more of a burden for application developers, which in the long run is more expensive. So no, at least IDNA has to be supported by the application. > Why can't servers just blindly accept non-ASCII bytes as is? A fully compliant RFC 3986 implementation can't accept non-ASCII characters. If that's what you have, you'll have to rewrite or replace it. RFC 3987 covers this, but it's a bit more specific than blindly accepting non-ASCII bytes. The chapters on the comparison ladder is a good read for an overview of what may need to be implemented to avoid false negative matching. -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201224/5037 0e9c/attachment.sig>
---