💾 Archived View for tilde.team › ~bp › text.gmi captured on 2024-07-08 at 23:54:30. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

This completely unreadable document aims to be a "just enough" guide on how to understand how computers handle text and why your computer is probably failing you right at this second.

You can probably safely ignore all of the side ramblings. They look like this.

Part the Zeroeth, of What I Assume You Already Know

numbers can be written in base 10, base 2 and base 16
0x10 is sixteen written in base 16 (mathematicians would write that 10₁₆)
0b10 is two written in base 2 (mathematicians would write that 10₂)
some very basic Python
the mental fortitude to either already know stuff about perl 5, to learn stuff about perl 5, or to skip the perl 5 section.

Making sure your gemini client isn't failing you right at this second

Gemini might be a protocol for simple text, but Unicode itself isn't simple. Please make sure this looks at least vaguely like a bunch of symbols arranged in a slashed square arrangement, without holes or question marks:

POTATO
o⟍  🐱
w ⟍  o
o  ⟍ w
각  ⟍o
文字ﾄﾘ

If it doesn't, your gemini client probably cannot render this page properly.

Geminaut, for example, doesn't support enough Unicode to render this page. It looks like it's using some kind of Internet Explorer 8-kinda webview internally, so it's not terribly surprising that emojis don't work.

Worst case scenario, try this:

echo gemini://tilde.team/~bp/text.gmi | ncat -C --ssl tilde.team 1695

A reasonably modern terminal emulator should do a good enough job of rendering this, probably?

...or view this page from the discomfort of your web browser.

Part the First, or Codepages

You've probably seen one of these tables before:

    0 1 2 3 4 5 6 7 8 9  A B C
 
0x    1 2 3 4 5 6 7 8 9  0 = -
1x  / S T U V W X Y Z ⟍  , ( ⟍
2x  - J K L M N O P Q R –0 $ *
3x  + A B C D E F G H I +0 . )
 
The CDC 1604 six-bit code.
⟍ = invalid.

source

It should resemble one of those ASCII tables, except of course it's much smaller, only supports uppercase characters, and the ordering of letters seems... backwards? And why are there not one, not two, but THREE zeroes?

Here's a hint if you're confused on how to read it: the letter A, at the intersection of the row "3x" and the column "1" corresponds to be byte 0x31.

Ah, this is a fun one. Let's say you're writing a wonderful program in COBOL, and you want to represents the price 23.10 in memory. The first step is to ask your coworkers in accounting how many digits before and after the comma they need. Let's say they tell you that you'll never have to worry about representing numbers bigger than 999,999.99, and that two digits after the comma are enough for now. You would then represent 23.10 as the byte string "0 0 0 0 2 3 1 0" (yup: one byte per digit.) Ah, but you need to keep track of whether this is an income or an expense: the number needs to have a sign! In that case, you either want +23.10 or –23.10, which you would represent as either "0 0 0 0 2 3 1 +0" or "0 0 0 0 2 3 1 –0". Indeed, COBOL will print out by default these amounts as " 23.10+" or " 23.10-", with the sign at the end instead of the start of the number. This is a delightful source of hilarity as you try to write modern comptuer code to siphon data out of legacy systems and into modern web pages. Implementing support for arithmetics on numbers represented like this is left as an exercise to the reader, although COBOL does support more computing-friendly ways to represent numbers in memory.

As it happens, this is one of the older ways humans used to represent text for computers to consume: on punch cards. In order to punch in an "A", code 0x31 (binary 0011 0001) you would punch in... a bunch of rectangular holes into an overlong airplane boarding pass-sized card that typically could only fit 80 characters.

This is, incidentally, the historical reason why some people will get upset if you put more than 80 characters in a line of code. Most terminals in the 1980s by default could only show 80 characters per line of text. The actual reason to introduce a limit is readability, but this is why the limit is specifically 80 for code and 72 for git commit messages.

In fact, all the way up until the late 1990s, we were drowning in tables like this: arbitrary tables of assignments of bytes to characters, also known as "codepages." If you open a document with the wrong codepage, hilarity ensues; this was such a common problem in Japan, they coined a name for it: 文字化け (a.k.a. mojibake).

I tried typesetting this document in LyX until I got to 文字化け. Turns out getting LaTeX and friends to mix latin and Japanese in the same document is... basically just not worth the effort? womp.

Part of the problem is that these tables go directly from "character you want to type" to "sequence of bytes that represent that character." As it often happens in computer science, you're coupling those two things too tightly, and introducing an intermediate layer smooths things over and makes everything better. Enter Unicode.

Part the Second: Unicode

The simple idea behind Unicode is it maps characters to numbers, also known as codepoints. How those numbers are then stored to disk is a separate concern. The table starts like it does in seven bit ASCII:

       0 1 2 3 4 5 6 7 8 9 A B C D E F
00000x ------control characters-------
00001x ------control characters-------
00002x   ! " # $ % & ' ( ) * + , - . /
00003x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
00004x @ A B C D E F G H I J K L M N O
00005x P Q R S T U V W X Y Z [ \ ] ^ _
       ............. etc .............

The very top of a very long table of symbols.

Don't let false familiarity confuse you. The previous table maps the *byte* 0x31 to the character 'A'. This table maps the *number* 0x41 (a.k.a. the number 65) to the character 'A':

$ python3
>>> char(65)
'A'

So while the previous table said the word "POTATO" is encoded with bytes 0x27 0x26 0x12 0x31 0x12 0x26, this other table says the word "POTATO" is made from numbers 0x50 (eighty), 0x4F (seventy-nine), 0x54 (eighty-four), 0x41 (sixty-five), 0x54 (eighty-four again), 0x4F (seventy-nine). Don't be confused just because I'm writing bytes and numbers both in hexadecimal form!

$ python3
>>> [ord(x) for x in "POTATO"]
[80, 79, 84, 65, 84, 79]

As it happens, in earlier versions of the Unicode standard (a.k.a. this table mapping characters to numbers), the objective was to cover only modern scripts. "2^16 characters should be enough for everyone," we thought.

Foreshadowing is a literary device in which a writer gives an advance hint of what is to come later in the story.

Part the Third: UTF

We still need to store those numbers on disk. There's a bunch of ways you can do that:

UTF-8, which maintains backwards compatibility with seven bit ASCII
UTF-16, for those nerds who thought being able to seek to the n-th character of a file is important and useful
UTF-32, for those nerds who thought being able to seek to the n-th character of a file is important and useful, except they got to make this decision AFTER the Unicode consortium decided to use numbers bigger than 2^16
punycode, which maintains backwards compatibility with browsers, DNS records, et al.

Let's go through each one of these options:

UTF-32

UTF-32 basically stores every number as a four-byte bit string. So POTATO becomes:

Symbol            P           O           T           A           T           O
Number      0x00050    0x00004f     0x00054     0x00041     0x00054     0x0004f
Bytes   00 00 00 50 00 00 00 4f 00 00 00 54 00 00 00 41 00 00 00 54 00 00 00 4f

Symbols:       6
Bytes:        24
Bytes/symbol:  4

That's simple, and boring, and as far as English text goes, it's clearly quite inefficient. Let's see how it handles the Japanese word "文字化け" from before:

Symbol          文          字          化          け
Number     0x06587     0x05b57     0x05316     0x03051
Bytes  00 00 65 87 00 00 5b 57 00 00 53 16 00 00 30 51

Symbols:       4
Bytes:        16
Bytes/symbol:  4

Hrm, that's better but not good: half of our bytes are still 00s! "Surely we can do better!"

UTF-16

UTF-16 basically stores every number as a two-byte* bit string. Here's what that looks like:

Symbol        P        O       T       A       T       O
Number  0x00050 0x00004f 0x00054 0x00041 0x00054 0x0004f
Bytes     00 50    00 4f  00 54    00 41   00 54   00 4f

Symbol      文      字      化      け
Number 0x06587 0x05b57 0x05316 0x03051
Bytes    65 87   5b 57   53 16   30 51

             POTATO 文字化け
Symbols:          6        4      
Bytes:           12        8
Bytes/symbol:     2        2

Oh what a joy! Oh what a thing of beauty! Sure, for English text half of our bytes are still zeros, but the Japanese text gets represented in such a clear, cristalline, simple fashion! Surely nothing will go wrong.

Foreshadowing is a literary device in which a writer gives an advance hint of what is to come later in the story.

Indeed, the demo above seems to have been convincing enough a bunch of undoubtedly smart people decided that UTF-16 was just a plain good idea. Windows picked it up as its internal representation of text. Java picked it up: its "char" type only goes up to 2^16. Every single computer that ever choked and broke on an emoji? It was probably using UTF-16 internally. Whoops!

Yeah, it turns out that soon enough a bunch of stuff needed to be encoded with numbers larger than 65,536. For a long time, this stuff was relatively niche, so developers got to bury their heads in the sand and pretend handling this corner case wasn't a priority. Unrelated: have you ever wondered why some technobros are still bitter about emojis?

Symbol       o       w       o      🐱
Number 0x0006f 0x00077 0x0006f 0x1f431
Bytes    00 6f   00 77   00 6f  whoops

Turns out the explanation above is insufficient to explain how UTF-16 works. The handling of cases like this is actually quite complex:

1. Start with the number

2. Take 0x10000 off

3. Split the binary representation of the result in two

4. The first byte is 0xD800 + the top 10 bits of the result

5. The second byte is 0xDC00 + the bottom 10 bits of the result

https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

The two resulting pairs of bytes are known as "surrogate pairs."

So in our case:

Symbol: 🐱 
Step 1.  0x1f431         0b0001 1111 0100 0011 0001
Step 2.  0x0f431         0b0000 1111 0100 0011 0001
Step 3.  0x03d           0b0000 1111 01
           0x031                     0b00 0011 0001
Step 4. 0xd83d   0b11 0110 0000 1111 01
Step 5.   0xdc31              0b1101 1100 0011 0001

and thus finally:

Symbol       o       w       o          🐱
Number 0x0006f 0x00077 0x0006f     0x1f431
Bytes    00 6f   00 77   00 6f d8 3d dc 31

Symbols:       4
Bytes:        10
Bytes/symbol:  2.5

This *extremely* elegant algorithm has a bunch of consequences:

If an even byte in an UTF-16-encoded text is 0xd8, 0xd9, 0xda or 0xdb, it represents the first half of a number bigger than 65,536
If an even byte in an UTF-16-encoded text is 0xdc, 0xdd, 0xde or 0xdf, it represents the second half of a number bigger than 65,536
This means Unicode numbers don't get to be four-byte numbers that start with 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde or 0xdf
A byte sequence such as d8 f4 00 6f where you have just the first 16 bits of a surrogate pair is illegal UTF-16, so now you have to be careful about that too
Similarly, you can't have 00 6f dc 31, where the last 16 bits of a surrogate pair don't have the first 16 bits.
Windows, presumably because it was built before surrogate pairs were a thing, will let you use these byte sequences, probably for backwards compatibility.
Java will tell you with a straight face that "owo🐱" is made of 5 chars. The fact 2 of those chars actually only form one symbol is basically your problem to deal with.

Aren't computers so normal and fine?

Oh, you haven't seen the half of it. We're really getting into the weeds here, but I really want to impress how bad UTF-16 turns out to be.

>>> [hex(int(x)) for x in "owo🐱".encode("utf-16")]
['0xff', '0xfe', '0x6f', '0x0', '0x77', '0x0', '0x6f', '0x0', '0x3d', '0xd8', '0x31', '0xdc']

that maps to this interpretation of the bytes:

Symbol               o       w       o          🐱
Number         0x0006f 0x00077 0x0006f     0x1f431
Bytes   ff fe    6f 00   77 00   6f 00 3d d8 31 dc

Symbols:        4
Bytes:         12
Bytes/symbol:   3

0xfffe is a byte order mark (BOM), saying that the bytes are to be interpreted in little endian, i.e. byte-wise inverted (so the number 0x01234567 gets represented by the bytes 0x67452301). For reasons™, many computer architectures are little endian (as opposed to big endian, where no such swapping occurs), whereas networks use big endian. Some people decided that it would be really convenient to be able to dump bytes out in either little or big endian; in the former case, the string gets prefixed with 0xfffe; in the latter case, the string gets prefixed with 0xfeff.

UTF-8

UTF-16 stores every number as a one-byte bit string. Okay, clearly that can't be the whole story... but so long as the numbers stay under 128, it is!

Symbol        P        O       T       A       T       O
Number  0x00050 0x00004f 0x00054 0x00041 0x00054 0x0004f
Bytes        50       4f      54      41      54      4f

For other numbers the scheme looks like this:

0. Determine how many bytes you need

1. The first byte starts with 0b11

2. Add an extra 1 for how many extra byte you end up needing after the second

3. Add a 0 bit as some kind of separator

4. Use the remaining bits to start representing the bits that make up the number

5. Every extra byte starts with 0b10 and uses the other six bits for the number

So let's go back to our smiley cat and see what happens:

Symbol: 🐱
Number: 0x1f431 or 0b1 1111 0100 0011 0001 (17-bits)

Will it fit in two UTF-8 bytes? That would be:

 0b110xxxxx 10xxxxxx
 (16 bits, 11 bits for the number, 5 overhead)

(...where every 'x' is storage allocated for the Unicode number of a given symbol.) That's no good. What about three UTF-8 bytes?

 0b1110xxxx 10xxxxxx 10xxxxxx
 (24 bits, 16 bits for the number, 8 overhead)

Still nope, just by a little. Will four be enough?

 0b11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
 (32 bits, 21 bits for the number, 11 overhead)

That'll work. Time to fill in the gaps:

Symbol: 🐱
Number:      0b000   011111   010000   110001
Format: 0b11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Result: 0b11110000 10011111 10010000 10110001
              0xf0     0x9f     0x90     0xb1

Let's check our homework:

$ python
>>> "🐱".encode('utf-8')
b'\xf0\x9f\x90\xb1'

So how do we fare on our other test strings?

Symbol       文       字       化       け
Number  0x06587  0x05b57  0x05316  0x03051
Bytes  e6 96 87 e5 ad 97 e5 8c 96 e3 81 91

Symbol       o       w       o          🐱
Number 0x0006f 0x00077 0x0006f     0x1f431
Bytes       6f      77      6f f0 9f 90 b1

             POTATO 文字化け owo🐱
Symbols:          6        4     4
Bytes:            6       12     7
Bytes/symbol:     1        3     1.75

Consequences of this encoding scheme:

If a byte starts with a '0' bit, it's backwards compatibility with seven bit ascii
If a byte starts with the '11' bits, it's the start of a multi-byte sequence
If a byte starts with the '10' bits, it's a continuation of a multi-byte sequence
That means that it's possible to have illegal UTF-8 strings, either because you have '10' bytes without a leading '11' byte, or because you have a different amount of '10' bytes than the '11' byte indicated, or because you have '11' bytes back to back, or an '11' byte followed by a '0' byte, etc..

As you can see the result is kind of a mixed bag. This is a triumph of eurocentrism, but Japanese text doesn't do so hot compared to UTF-16, for example. However, UTF-8 is explicitly a variable-length text encoding. When emojis showed up, software written in UTF-8 just worked, whereas stuff written UTF-16 had its assumptions broken and found itself to be brittle and complicated, to the point where stuff like Java passes the problems onto you, good luck and have fun.

So this is probably part of the reason why UTF-8 ended up being the de-facto standard encoding of the internet.

What also helps is that the Unix/Linux/BSD world bet on UTF-8 over UTF-16, and that's a majority of servers in the world run.

However, you might think, 3 bytes per symbol is not THAT much worse than 4 bytes per symbol. Being able to grab a chunk of 4 kilobytes of UTF-32 text and assert that will certainly decode to 1,024 valid symbols has some kind of intrinsic value, right?

Nevermind the part where UTF-16 created this 0xd8xx-0xdfxx no-go zone for Unicode numbers...

More byte order mark trivia

One legitimately useful use of byte order marks is for strings to embed information about their encoding. As explained before, if a sequence of bytes starts with 0xfeff or 0xfffe, you can reasonably guess that your document is encoded with UTF-16. Other UTF encodings also support byte order marks, although they typically aren't in use:

* The byte order marks in UTF-32 are 0x0000feff and 0xfffe0000.

* The only supported "byte order mark" in UTF-8 is 0xefbbbf. It's not really a byte order mark, but rather a sequence of magic bytes that does nothing but scream "this is UTF-8".

Magic bytes, also known as "file signatures" are tell-tale signs that a given binary blob can be interpreted in a certain way. For example, all PDF files start with "%PDF-".

More about magic bytes.

Part the Fourth, or the Eurocentrism strikes back

In the bad old days of codepages, European languages tried desperately to fit into one byte per symbol.

This was a doomed effort for languages like Chinese and Japanese, which developed their own microcosm of ways to deal with the richness of their rune sets, ways that required dedicated specialized versions of Windows or DOS. Western Windows XP for example can't run Japanese games from that era without installing a bunch of additional software.

Googling "japanese games western windows xp" automatically disables safe search. Google assumes that, if you're doing this, you must be trying to play obscure hentai games ("eroge"s).

This meant that western european text (code page 1252 or Latin-1), central european text (code page 1250 or* Latin-2), greek text (code page 1253), turkish text (code page 1254), etc. was all mutually incompatible. After all, Latin-2 languages needed to be able to represent a Ř, Latin-1 languages needed to be able to represent a "Ë", and the reverse wasn't true.

Is this a good point to mention that code page 1250 is Windows' idea of "Latin-2", separate and different from the standardized ISO/IEC 8859-2 character encoding? Code pages 1252 and ISO/IEC 8859-2 "Latin-1" don't have these disagreements. Again, computers have their problems, but a lot of them can be traced to the people behind them.

Oh, also a bunch of these codepages had to update in 2001 to make room for the Euro, so what's commonly known as "codepage 1250" actually refers to Coded Character Set Identifiers 5346 and... yeah those are enough tangents even for me.

One wart of this system is that, while Latin-1 gets to enjoy freebies like ¼ or ², Latin-2 actually doesn't have enough room for every character-diacritic combination. Not to worry: the codepage has room for standalone diacritics, so you can combine the ˛ diacritic (0xB2) with the A symbol (good old 0x41) to make Ą (0xB2 0x41)... a character that's also available on its own as 0xA5. Oh no!

Suddenly there's no longer just one way to turn characters into symbols, a characteristic that Unicode maintains thanks to backwards compatibility (and other similar issues that result from combining CJK characters, I don't understand this part well enough). You end up in situations where you might need two symbols to make one character, or where the same character can be represented in multiple different ways!

In order to keep sanity, Unicode introduces two important concepts: "normalization" and "grapheme clusters."

"Normalization" acknowledges that there's more than one way to turn text into numbers; if this bothers you, you can pick your favourite way and computer code you don't have to write will either turn all the ˛A's into Ą's or viceversa.

"Grapheme clusters" is a more interesting concept: it basically tries to formalize the idea of character, or what should happen when you're in a text editor and you press the left or right button. The user wants to move forward to the next "character", but where is that?

The concept of grapheme clusters ruins the only good thing UTF-32 had going for it: yeah, you can grab 4 kilobytes of text to grab 1,024 symbols, but doing so might split a grapheme cluster apart, resulting in broken text in both chunks. Unfortunately reality isn't so accomodating for developers!

In other news, you now know that the Correct answer to "what is the length of a string?" is now "trick question! do you want to know how many grapheme clusters there are, how many unicode symbols, or how many bytes?" Most programming languages will give you the easy answer, and leave the accurate answer to dedicated libraries.

That also means you can't just check if two binary strings are equal to conclude that their text is equal, or even that they're using the same unicode numbers.

Let's look for example at the "Hangul Syllable Gag", 각. That character is made by the symbols ㄱ, ㅏ and ㄱ arranged together into the same square like syllables into a word.

$ pip install grapheme unicodedata
$ python3
>>> import grapheme, unicodedata
>>> x = "각" # composed character
>>> y = unicodedata.normalize("NFD", x) # Normal Form Decomposed
>>> y
'각'
>>> [hex(ord(i)) for i in x]
['0xac01']
>>> [hex(ord(i)) for i in y]
['0x1100', '0x1161', '0x11a8']
>>> len(x), grapheme.length(x)
(1, 1)
>>> len(y), grapheme.length(y)
(3, 1)
>>> x == y
False

Not only do computers handle 각 and 각 differently, on Windows they render differently, and the terminal emulator I'm using to write this staunchly refuses to render the latter text correctly. For bonus points, note that the denormalized 각 version and the symbols "ㄱㅏㄱ" from above are made by DIFFERENT unicode numbers that do NOT normalize to one another.

>>> z = 'ㄱㅏㄱ'
>>> [hex(ord(i)) for i in z]
['0x3131', '0x314f', '0x3131']
>>> unicodedata.normalize("NFC", z) # Normal Form Composed, the opposite of NFD
'ㄱㅏㄱ'
>>> _ == z
True

I'm sure this makes perfect sense to Korean writers, but it did trip me up while writing this section. :)

source for this 각 diversion

Part the Fourth, or Just Enough Japanese to be Dangerous

Written Japanese is a language that's sometimes phonetic (= what you say* is what you write) and sometimes ideographic (= a concept maps to one or a few runes).

That's not 100% true; you're still missing information about intonation, stress, etc. but I'm not aware of a phonetic writing system that does encode that information always. Italian uses diacritics for this; for example, 'dà" (he gives) vs "da" (from); for a more old-fashioned example, ancòra (still) vs àncora (anchor).)

Phonetically written Japanese maps relatively cleanly to the Latin alphabet, resulting in Romaji, Japanese you can type on your QWERTY keyboard. As you type every word, your computer asks you if you want an ideogram or a phonetic transcription.

I mean QWERTY keyboard not to say that you can't type Japanese on a DVORAK keyboard but to distinguish it from other input methods that either only work on a touch screen (like the flick keyboard)) or map a physical key to a specific syllable (the thumb-shift method).

I'll teach you this one test sentence — the only one I know:

"I am American"

This is how you go from English to Romaji (what you type) to Japanese (and whether you're using ideographic or phonetic writing)

I                                  → watashi → 私       (ideogram)
[, the topic of this conversation] → ha      → は       (phonetic*)
America-                           → amerika → アメリカ (alternative phonetic)
-n (a person)                      → jin     → 人       (ideogram)
am                                 → desu    → です     (phonetic)

the expected output is like so, without any spaces:

私はアメリカ人です

and you can get it by typing:

watashi ha amerika jin desu

but at every step there you're gonna have to stop and tell the computer what it is you actually want to write.

Thus:

Install a Japanese input method and switch to it. Whatever comes by default on your target OS is fine; just make sure you pick the QWERTY keyboard
Type in watashi, and you'll see the text "わたし" either underlined or highlighted. That's the word in phonetic writing; a dropdown menu or autocompletion should suggest the rune 私 instead. Pick it. On a computer, you should be able to navigate the dropdown with the cursor and then confirm with space.
Type in ha. A dropdown is gonna open again, but you'll have the correct thing "は" already there. On the computer, space should confirm your input. On mobile, hit enter.
Type in amerika. You'll get あめりか which is not what you want. The dropdown should suggest the alternative phonetic writing specific for foreign words: アメリカ. Use that.
Type in jin and pick 人
Type in "desu" and accept the default です
You're done.

With this knowledge now you get to:

Make sure your thing is compatible with an input method to begin with
Type that in your app, hit submit, make sure you get it back exactly right
Check that the input method dropdown doesn't conflict with your website design (are you trying to provide your own autocompletion?)
Check that your user experience plays nice with the input method user experience (depending on the mode of interaction, you might receive an input event for EVERY action the user takes, or you might get the whole string all at once.)
Make sure you can handle both the space the English keyboard uses (' ', 0x20) and the double-width space the Japanese keyboard uses ('　', 0x3000)
Oh yeah, the runes come in both full-width (とり）and in half-width (ﾄﾘ) flavours so, if that makes a difference for you, be ready for that! This is also a problem for Chinese, Korean, emoji, ...
if you're using a terminal and are trying to print a table, make sure you afford double the space for full-width characters (and emojis, and...)

If at any time you can't find the completion you're looking for, that means you typed the wrong thing, or you tried to pick a completion at the wrong time. Even on mobile, there's little tolerance for typoes, and the home row on Android is a little wider than on the English keyboard layout, something that made me typo a lot while writing this. Sometimes, though, you can get away with a lot: I just typed in the whole thing without spaces, "watashiwaamerikajindesu", on Windows 11 and the Microsoft input method managed to get it perfectly right: 私はアメリカ人です.

Thanks to @Lumidaub@social.social.tchncs.de for the corrections.

Part the Fifth, or self-cancelling bugs in unicode-unaware programs

Caution: this is the section with perl 5 in it.

Perl is, vaguely speaking, an evolution of awk and ed. I could write dozens of paragraphs about Perl, but the main purpose here is to impress upon you the kind of issues that can arise when you get Unicode stuff really wrong.

Without worrying too much about the syntax, this is how you would write cat in perl:

$ cat cat.pl
while(my $line = <STDIN>) {
  print $line;
}

Let's put it in action!

$ echo 私はアメリカ人です | perl cat.pl
私はアメリカ人です

Damn, we're good at this unicode stuff, aren't we?

Foreshadowing is...

Well, let's say that, emboldened by our success, we want to implement tac, a program that gives you your text backwards. We already know this is far from a trivial endeavour but let's see how far away from success we are. The code is very similar, we just need to juggle the newlines around a little:

$ cat tac.pl
while(my $line = <STDIN>) {
  chomp $line;
  print scalar reverse $line;
  print "\n";
}

Let's try it in the simple case:

$ echo hello | perl tac.pl
olleh

...and now with japanese!

$ echo 私はアメリカ人です | perl tac.pl
㧁㺺䫂㪃㡃㢂㯁で

Not only did we get garbage back, but we got the WRONG amount of garbage back: 9 runes went in, eight came out. Indeed:

$ echo 㧁㺺䫂㪃㡃㢂㯁で | perl tac.pl
はアメリカ人で

...a whole bunch of stuff went missing! Where did the 私 go?

Clearly something's funky with the terminal, or at least the way the program interfaces with it. Let's see if this makes any difference:

$ cat wtf.pl
my $line = "私はアメリカ人です";
print scalar reverse $line;
print "\n";

$ perl wtf.pl
㧁㺺䫂㪃㡃㢂㯁で

Hrm, same garbage as before... but some very casual googling says that the incantation "use utf8;" might help?

$ cat wtf.pl
use utf8;
my $line = "私はアメリカ人です";
print scalar reverse $line;
print "\n";

$ perl wtf.pl
Wide character in print at wtf.pl line 3.
すで人カリメアは私

oh hey! That's what we want! But there's also some odd warning about wide characters? What's going on with that?

Google says you can get rid of that one with this incantation: "binmode STDOUT, 'utf8';" and sure enough:

$ cat wtf.pl
use utf8;
binmode STDOUT, 'utf8';
my $line = "私はアメリカ人です";
print scalar reverse $line;
print "\n";

$ perl wtf.pl
すで人カリメアは私

The binmode line basically tells Perl that STDOUT is not Latin-1, but UTF-8*.

Let's patch the original program:

$ cat tac.pl
use utf8;
binmode STDOUT, 'utf8';
while(my $line = <STDIN>) {
  chomp $line;
  print scalar reverse $line;
  print "\n";
}

$ echo 私はアメリカ人です | perl tac.pl
ã§ãººä«ãªã¡ã¢ã¯ã§ç

Okay, that's actually worse than what we started with. Oh, and while the previous program turned few characters into fewer characters, this one does the opposite in a mojibake explosion:

$ echo ã§ãººä«ãªã¡ã¢ã¯ã§ç | perl tac.pl
§Ã§Â£Ã¯Â£Ã¢Â£Ã¡Â£ÃªÂ£Ã«Â¤ÃºÂºÂ£Ã§Â£Ã

It's time to unbury the lede:

Perl 5 expects by default STDIN to be in Latin-1
Perl 5 expects by default STDOUT to be in Latin-1
Perl 5 expects by default your source code to be in Latin-1

You can tell Perl 5 to expect UTF-8 in STDIN with "binmode STDIN, 'utf8';""
You can tell Perl 5 to expect UTF-8 in STDOUT with "binmode STDOUT, 'utf8';""
You can tell Perl 5 to expect UTF-8 in your source code with "use utf8;"

Sure enough, applying ALL the fixes results in a correct program:

$ cat tac.pl
use utf8; # now unnecessary
binmode STDOUT, 'utf8';
binmode STDIN, 'utf8';
while(my $line = <STDIN>) {
  chomp $line;
  print scalar reverse $line;
  print "\n";
}

$ echo 私はアメリカ人です | perl tac.pl
すで人カリメアは私

Well, a correct-er program. As we already know, what we really want isn't revering each Unicode symbol, but reversing every Unicode grapheme cluster. Perl has facilities to handle this, but that's outside of the scope of this conversation.

But this begs the question: if our original program was so wrong, how did it even work?

$ cat cat.pl
while(my $line = <STDIN>) {
  print $line;
}

$ echo 私はアメリカ人です | perl tac.pl
私はアメリカ人です

Based on what I told you, This program clearly expects to be dealing with nothing but eight-bit ASCII, but seems to handle Unicode like a champ! What gives? What was the deal with ã§ãººä«ãªã¡ã¢ã¯ã§ç and 㧁㺺䫂㪃㡃㢂㯁で?

It turns out the bugs in cat.pl cancel each other out:

1. The terminal sends UTF-8 in: 私はアメリカ人です

2. The program takes in garbage Latin-1: ç§ã¯ã¢ã¡ã��ã«äººã§ã

3. The program spits out garbage Latin-1: ç§ã¯ã¢ã¡ãªã«äººã§ã

4. The terminal takes the garbage and interprets it as UTF-8: 私はアメリカ人です

Since the program didn't change the byte stream at all, the fact perl didn't touch it at all means it got away with misunderstanding things completely: the bugs cancelled each other out.

In order to have cat.pl do what it's doing correctly, you have to fix BOTH the input and the output. If only apply just one of the fixes, the program misbehaves. Nasty!

As a complicating factor, let's look at what happened when we got 㧁㺺䫂㪃㡃㢂㯁で. To recap:

$ cat broken.pl
while(my $line = <STDIN>) {
  chomp $line;
  print scalar reverse $line;
  print "\n";
}

$ echo 私はアメリカ人です | perl broken.pl
㧁㺺䫂㪃㡃㢂㯁で
$ echo 㧁㺺䫂㪃㡃㢂㯁で | perl broken.pl
はアメリカ人で

As you can see the errors compound much like they did with ç§ã¯ã¢ã¡ãªã«äººã§ã and §Ã§Â£Ã¯Â£Ã¢Â£Ã¡Â£ÃªÂ£Ã«Â¤ÃºÂºÂ£Ã§Â£Ã. However!

$ echo 私はアメリカ人です | perl broken.pl | perl broken.pl
私はアメリカ人です

How is this possible? How did we get the original string back if we call the program like this?

Well, this is what just happened:

1. The terminal sends UTF-8 in: 私はアメリカ人です

2. The first program takes in garbage Latin-1: ç§ã¯ã¢ã¡ãªã«äººã§ã

3. The first program spits out garbage Latin-1: ã§ãººä«ãªã¡ã¢ã¯ã§ç

4. The second program takes in garbage Latin-1: ã§ãººä«ãªã¡ã¢ã¯ã§ç

5. The second program takes in garbage Latin-1: ç§ã¯ã¢ã¡ãªã«äººã§ã

6. The terminal takes the garbage and interprets it as UTF-8: 私はアメリカ人です

Basically running broken.pl twice is the same as the broken cat.pl, just with more steps. But then what's with 㧁㺺䫂㪃㡃㢂㯁で?

Let's analyze what happens when we run broken.pl just once:

1. The terminal sends UTF-8 in: 私はアメリカ人です

2. The first program takes in garbage Latin-1: ç§ã¯ã¢ã¡ãªã«äººã§ã

3. The first program spits out garbage Latin-1: ã§ãººä«ãªã¡ã¢ã¯ã§ç

4. The terminal tries to interpret this as UTF-8, but the byte stream doesn't quite work as UTF-8 text. The terminal skips what it can't understand and prints the rest.

We can verify this is what's going on by using hexdump to print out the actual byte sequence:

$ echo 私はアメリカ人です | perl broken.pl | hexdump -C
00000000  99 81 e3 a7 81 e3 ba ba  e4 ab 82 e3 aa 83 e3 a1  |................|
00000010  83 e3 a2 82 e3 af 81 e3  81 a7 e7 0a              |............|
0000001c

Byte 0x99 is written as 0b10011001. As we now know, that byte starts with 0b10, meaning it's a sequence continuation byte with no corresponding 0b11 character to start the sequence. That's invalid UTF-8. If it was to Python this would already be game over:

$ echo 私はアメリカ人です | perl broken.pl | python -c 'import sys; print(sys.stdin.read())'
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 0: invalid start byte

The terminal is more forgiving than that, however. So what if0x99 starts with 0b10? The terminal just ignores it. The next byte, 0x81 also starts with 0b10: that gets discarded too. After that, we get three bytes that happen to start with 0b1110, 0b10 and 0b10: 0xe3a781 looks like valid UTF-8, so the termianl spits out the corresponding character, which happens to be 㧁. Let's check our homework:

$ python3
>>> '㧁'.encode('utf-8')
b'\xe3\xa7\x81'

The terminal is discaring bytes left and right, so when we try and use what the terminal shows us to reverse the result... it should be no surprise that text goes missing and we only get はアメリカ人で.

Take-aways:

Make sure your program treats incoming text with the expected encoding
Make sure your program spits out text in the expected encoding
Make sure your terminal showss text in the expected encoding
UTF-8 means programs get to confuse bytes and text. This is the source of much confusion. For example, Linux doesn't really check that filenames are valid UTF-8 strings... hilarity ensues.
When multiple things are wrong, problems sometimes become worse as you make steps towards fixing them. Don't be discouraged!

Being able to confuse bytes and text, and the bugs that result from that, is the main reason for all of the gnashing of teeth that happened when Python 3 was released and it tried to force developers to get this stuff right, whereas Python 2 wallowed in this confusion.

[...]