I'm still enjoying working with Go. I'm still working on Oddµ. Here's what's been on my mind this morning: How do I link fediverse accounts to their profile pages? Example: @alex@alexschroeder.ch. There are two problems I can see:
Let's take a look.
The first problem is the one that I am unable to crack. I think what would need to happen is for me to not look at bytes but do actual UTF-8 decoding. Right now, what I'm doing is this, once I know that we're looking at an `@`: skip forward over `a`-`z`, `A`-`Z`, `0`-`9`, `.` and `-` and exactly one other `@` and then I skip backwards over `.` and that this is correct. In the case of the account name above, this makes sure that the `.` is not included. For international domains, that wouldn't work. The bytes for `schröder.ch` would include the bytes 0xc3 and 0xb6 to encode the `ö` in UTF-8.
I guess I could add "skip forward over any byte that's above 127" to my list. This would work for `schröder.ch` but it wouldn't always end correctly. Does `schröder.ch…` include the ellipsis? I'd say it does not. It is encoded by the bytes 0xe2, 0x80 and 0xa6 – all of which are bigger than 127. In other words, I need to decode the bytes and then figure out if the bytes are valid characters in a domain name.
This task is nearly impossible to solve. If you want to stare into the abyss, read RFC 5890 or RFC 5892 Appendix B. I think in the context of trusted (big ask!) wiki authors wanting link to a profile page would expect that any letters and digits plus the period and the hyphen are valid domain name constituents.
So what I really need is Unicode categories. I'm guessing a more-or-less valid domain name as entered by a nice person that is copying and pasting a valid account name from somewhere, the rules are: "letters, period and hyphen"? I know you can't register some of these domains, since the allowed set can vary by registrar and those rules can change over time; and some of the characters are disallowed because they can be used to make phishing attacks; some characters may not be mixed for the same reason; but in our case, those cases don't matter: if the resulting international resource identifier (IRI) doesn't point to a valid profile, that's not a problem.
But of course there are exceptions. 😭
ZERO-WIDTH NON-JOINER "may occur in a formally cursive script (such as Arabic)" … ZERO WIDTH JOINER "may occur in Indic scripts in a consonant-conjunct context" … and so on. It gets very specific! MIDDLE DOT "Between 'l' (U+006C) characters only, used to permit the Catalan character ela geminada to be expressed." – RFC 5982 Appendix A.1
There really is no easy way out.
Perhaps all I can hope for is to improve this slowly, over time.
So the smallest next improvement should be the inclusion of all the Unicode letters. Let's see how far that gets us.
Ah. Now I remember. I wanted to talk about two things.
The second thing I wanted to talk about is Webfinger. The problem is this: If you have an account like @alex@alexschroeder.ch then you can't just turn that into https://alexschroeder.ch/users/alex. It works most of the time, for sure. The popular servers implement that because Mastodon does it. The form https://alexschroeder.ch/@alex is currently the redirect target of many servers. Sadly, there are servers that support only the first form and not the second, and there are servers that don't accept either of them.
https://alexschroeder.ch/users/alex
https://alexschroeder.ch/@alex
What you are supposed to do is slow and requires caching in some form. You are supposed to take the domain part of the account name, and do a webfinger query for the account resource. The response tells you where the actual profile page is. Looking at https://alexschroeder.ch/.well-known/webfinger?resource=acct:alex@alexschroeder.ch gets you the JSON document that tells you where the profile page resides: https://social.alexschroeder.ch/@alex.
https://alexschroeder.ch/.well-known/webfinger?resource=acct:alex@alexschroeder.ch
https://social.alexschroeder.ch/@alex
I am loath to do these lookups because then we get into the topic of cache expiry and cache storage. I could just keep it in memory and rely on the occasional restarts to clear the cache as a first step. And then start to expire cache entries older than a few days at a later stage. I can't say I like it.
I also don't want to do this while rendering a page because the page loading will be slow every time you have a new account name. So what I'm doing right now is that I use the guess I mentioned above, an URI of the form https://alexschroeder.ch/users/alex – and I start a background process to do the lookup. The lookup then populates a map of accounts to URIs and overwrites the guess. A subsequent reload of the page then has the improved URI.
https://alexschroeder.ch/users/alex
On a small site without a lot of traffic, I think it's OK not to persist this map. Restarts automatically clear the map.
Let's see if this works. This page has an account where the guess is wrong: https://alexschroeder.ch/users/alex doesn't exist. But within a few seconds, this should be fixed.
https://alexschroeder.ch/users/alex
#Oddµ #Software #Programming