💾 Archived View for geminiprotocol.net › history › phlog › gemini-maps-3.gmi captured on 2024-07-08 at 23:48:11. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-09-08)

-=-=-=-=-=-=-

Gemini maps 3

(originally posted in Gopherspace on 2019-06-25)

Whew! Simplicity ain't simple, folks.

Sean is not happy with the newly proposed [text|link] syntax. Nevertheless, there are now items using that syntax up at gemini.conman.org. In fact, both the new syntax and the old tab-based syntax are used there, which is smart! Clients written for one will just display the other as raw text without any kind of error, so this is a wonderful way to maintain compatibility during the transitional period. I really appreciate Sean being a good sport and including the new links even though he's not a fan.

The objection to the new syntax is that [, ] and | are ASCII printable characters which a user might reasonably want to include in the text portion of a link, or which could evenly conceivably need to appear in the link part, if they were used in, e.g. a filename. It would be easy to say "those are dumb characters to use in a filename, don't do that", but arbitrary restrictions like this are unattractive and it would be nice to avoid them where possible.

It seems obvious that *any* proposed syntax is going to have some disadvantage or limitation which means some won't like it. The sensible thing to do is to consider the severity and frequency of these problems and choose a syntax whose inevitable shortcomings which minimise those things as much as possible.

I'm still very convinced that tabs are a bad idea due to the impossibility of unambiguously parsing an intended link with your eyeballs. They're also problematic because when using a mouse to copy and paste text in a terminal environment, the tab/space distinction can easily get lost. I can very easily see these kinds of issues causing confusion and frustration and broken links time and time again. I think that syntax would ultimately cause more pain more often than a limitation like "thou shalt not use these three slightly unusual characters in your link text or filenames", so I still think this latest step has been in the right direction.

And actually, I'm not sure the problem is as bad as "thou shalt not...".

The new spec says that a link line must begin with a [ and end with a ]. That doesn't conflict at all with arbitrary additional instances of either character. My two Python clients recognise and parse links like this:

if line[0] == "[" and line[-1] == "]" and line.count("|") == 1:
	text, link = line[1:-1].split("|")

That code will happily handle a text (or link) component of "[[]][lolz!][[]]", no worries. And not because I was smart and wrote clever code that could handle it. I wrote the most straightforward code for this possible and it just happens to be totally robust to [ and ] characters appearing anywhere else. So, I think the problem is in fact limited entirely to |.

Using a | anywhere in the text or link component will result in my code above not recognising a line as a link. so we need to think about that.

The "link" part of [text|link] is supposed to be a URL. Previously I said that absolute URLs were definitely okay and I would think about relative ones. Now that I've written a 100 line client which handles relative URLs, I'm convinced that supporting them is not really very difficult at all, so let's say they're allowed.

Absolute URLs are defined in RFC1738, and the | character (and, incidentally, the [ and ] characters) are specified there as being "unsafe" characters which must be escaped in URLs. Relative URLs have their own spec in RFC1808, which doesn't explicitly discuss "unsafe" characters, but based on a cursory skim (and expectation of reasonable sanity in these RFCs) I don't think those characters suddenly become safe in a relative context. So, it seems to me that it is actually totally invalid to use a | in the right-hand part of a link with the new syntax.

In passing - I'm not thrilled that invoking those URL RFCs means that a really robust Gemini client/server is now probably going to have to do all sorts of fiddly escaping/unescaping to cover edge cases. But, using a non-standard definition of URLs is a sure road to madness, and there will be existing libraries for this stuff anyway.

Now, strictly speaking, if there are guaranteed to be no |s in the right-hand part of a link, this totally disambiguates any use of them in the left-hand part: you just take everything after the *last* | as the URL and everything before it as the text. The following code:

if line[0] == "[" and line[-1] == "]" and "|" in line:
	text, link = line[1:-1].rsplit("|", 1)

should allow arbitrary use of [, ] and | in the text part of a link without problems, as long as the link part is a valid URL with any occurrence of | escaped.

But I'm not really thrilled about this, because it means that you need to be just a little bit careful when parsing links. The above makes it look easier than I think it will be in the average case - Python's rsplit method makes this trivial but some languages lack an equivalent and in general this approach is probably going to add a line of code or three.

It's still, though, much better than saying "if you want to use a | in your link text, escape it as \|". I surely don't want to go down that route.

The alternative is to say "thou shalt not use |s in your link names", which would allow a simple "ordinary split" on the one and only | character. This would be kind of a shame, and I'd rather not have this kind of limitation in place. That said, I don't think this particular limitation can be considered deal-breakingly bad. The use of | as a separator in linky contexts is not uncommon, so people using geomyidae and MediaWiki are living with that restriction, and they don't precisely seem miserable about it.

I don't know which of these is the lesser of two evils. Opinions very welcome!

It's true that the original tab-based syntax was easier to parse - and also much easier to write. It's certainly a selling point that the tab key is *right there* on the keyboard, and even non-technical folk who don't know that "pipe" means | know exactly where tab it is. Just writing that makes me feel bad about the new syntax! But one of the lessons I think that gopher has for us is that there's such a thing as what ratfactor called "the wrong kind of simplicity" - simplicity which results in ambiguity and/or disproportionate extra work. I think that a link syntax which is very quick and easy to write and parse but also enables the user to very quickly and easily make mistakes which are hard to spot and fix might just be the wrong kind of simplicity.

I regret immediately implementing the new syntax just because I personally found the argument for it very compelling, without taking longer to check in with the other people with a stake in this, assuming that it would be unobjectionable. In the spirit of my recent "slowing down" post, I'm not going to immediately scramble to adopt a third syntax. Instead, let's talk about it and try to build consensus. But I don't want us to get bogged down forever on this one detail. Sean, if the above discussion has adequately addressed your concerns, let me know and we'll stick with the new syntax. If you really still don't like it, we can consider yet a third option (I'm not willing to go back to the first tab option due to its severe shortcomings), but I insist that the third one will be the last one, so that we can move on to more interesting and important questions.

If anybody has any proposals, please write it up and let me know. If you phlog about it, please share the link with me via email or Mastodon because this conversation has extended beyond the small subspace of gopherspace which I routinely keep tabs on and I may not see it. I will not consider any proposal which has the whitespace ambiguity problem of our first syntax, or which is even a little bit more difficult to parse correctly than the second syntax. Please don't propose a change just because you think different syntax would "look nicer". A new proposal should be motivated entirely by practical considerations regarding ease of writing (which I think probably has to rule out otherwise interesting options like the "separator" ASCII control characters), parsing (please, nothing which makes people even *consider* using a regex), lack of ambiguity, etc.

Here's something to ponder: using a printable ASCII character like | to separate the text from the link actually isn't necessary, as unencoded whitespace is forbidden in valid URLs. So, once a line has been identified as being a link, it's possible to just split it on whitespace and take the last component as the URL. The following is unambiguous, right?

[Mare Tranquillitatis People's Circumlunar Zaibatsu gemini://zaibatsu.circumlunar.space]

This would allow | and indeed anything else to appear in the text part (or even, RFC-breakingly, in the URL), no ambiguity problems. Given this insight, in fact, the only thing we need on top of <NAME><WHITESPACE><LINK> is some "garnish" to aid in recognition as a link. This could be something bracketing that whole construct, as above, or just a distinctive character combination in front. Anything below seems like it would work?

#! Mare Tranquillitatis People's Circumlunar Zaibatsu gemini://zaibatsu.circumlunar.space
@~ Mare Tranquillitatis People's Circumlunar Zaibatsu gemini://zaibatsu.circumlunar.space
=> Mare Tranquillitatis People's Circumlunar Zaibatsu gemini://zaibatsu.circumlunar.space

This is very easy to recognise. Instead of:

if line[0] == "[" and line[-1] == "]" and line.count("|") == 1:

one would need just, e.g.:

if line.startswith("#! "):

That's actually *much* nicer to look at, IMHO. This syntax is also very obviously not intended to be used in-line, which is another advantage over [text|link]. I have to say, to me this kind of syntax seems to avoid the worst problems of both the earlier proposals. It also feels "lighter" somehow, I guess because all the actual syntax is concentrated in one place on the left, instead of being spread all along the line. I'm not attached to any particular pair of characters at the start. Can people see real problems with this approach?