Adventures in Utext

There is one point on the ASCII (American Standard Code for Information Interchange) ↔︎ JS (JavaScript) spectrum that I haven’t seen, and it’s one that, as I use Unicode in more complex ways on Gwern.net and have learned how many obscure features or characters Unicode has, I increasingly think has been neglected: only UTF (Unicode Transformation Format)-8 text rendered by a monospace font. Not ASCII, not a weird subset of SGML (Standard Generalized Markup Language), not troff, not raw terminal codes, not bitmaps encoded in ASCII—just UTF-8. This document format does only what pure Unicode text can do—but does everything that pure Unicode can do, which turns out to be a lot. What if we take Unicode literally, but not seriously?
Your typical plain text output strips all formatting. At the most ambitious, it might have a Unicode superscript or fraction. But we can do so much more!

“Utext: Rich Unicode Documents · Gwern.net [1]”

That was an interesting read (your mileage may vary).

To generate the gopher and Gemini versions of my blog, I parse the HTML (HyperText Markup Language) [2] and generate either plain text (for gopher) or Gemtext for Gemini. And I'm still not entirely happy with the output. For emphasized text, I would translate that to “*emphasized*”, which is … okay, I guess? And for [DELETED-deleted-DELETED] text—that was a harder to deal with, and I ended up with “[DELETED-deleted-DELETED]” text.

There's no excuse for that.

But after reading about Utext, and Uncode's COMBINING SHORT STROKE OVERLAY [3] and COMBINING LOW LINE [4] I thought I might try using those for some typographical niceties that you don't normally get with plain text. And that's when I learned that not all virtual terminals support all of Unicode all that well. And wraping text is … not that trivial anymore [5].

Ah well. For now, it seems to be working, but it remains to be seen if I like the results.

Update on Friday, December 8^th, 2023

I reverted this change due to issues [6].

[1] https://gwern.net/utext

[2] /boston/2021/12/06.2

[3] https://en.wikipedia.org/wiki/Strikethrough#Unicode

[4] https://en.wikipedia.org/wiki/Underscore#Unicode

[5] https://www.unicode.org/reports/tr14/

[6] /boston/2023/12/08.1

Gemini Mention this post

Contact the author