💾 Archived View for tilde.team › ~bp › text.gmi captured on 2024-12-17 at 10:06:37. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-01-29)
-=-=-=-=-=-=-
This completely unreadable document aims to be a "just enough" guide on how to understand how computers handle text and why your computer is probably failing you right at this second.
You can probably safely ignore all of the side ramblings. They look like this.
Gemini might be a protocol for simple text, but Unicode itself isn't simple. Please make sure this looks at least vaguely like a bunch of symbols arranged in a slashed square arrangement, without holes or question marks:
POTATO o⟍ 🐱 w ⟍ o o ⟍ w 각 ⟍o 文字トリ
If it doesn't, your gemini client probably cannot render this page properly.
Geminaut, for example, doesn't support enough Unicode to render this page. It looks like it's using some kind of Internet Explorer 8-kinda webview internally, so it's not terribly surprising that emojis don't work.
Worst case scenario, try this:
echo gemini://tilde.team/~bp/text.gmi | ncat -C --ssl tilde.team 1695
A reasonably modern terminal emulator should do a good enough job of rendering this, probably?
...or view this page from the discomfort of your web browser.
You've probably seen one of these tables before:
0 1 2 3 4 5 6 7 8 9 A B C 0x 1 2 3 4 5 6 7 8 9 0 = - 1x / S T U V W X Y Z ⟍ , ( ⟍ 2x - J K L M N O P Q R –0 $ * 3x + A B C D E F G H I +0 . ) The CDC 1604 six-bit code. ⟍ = invalid.
It should resemble one of those ASCII tables, except of course it's much smaller, only supports uppercase characters, and the ordering of letters seems... backwards? And why are there not one, not two, but THREE zeroes?
Here's a hint if you're confused on how to read it: the letter A, at the intersection of the row "3x" and the column "1" corresponds to be byte 0x31.
Ah, this is a fun one. Let's say you're writing a wonderful program in COBOL, and you want to represents the price 23.10 in memory. The first step is to ask your coworkers in accounting how many digits before and after the comma they need. Let's say they tell you that you'll never have to worry about representing numbers bigger than 999,999.99, and that two digits after the comma are enough for now. You would then represent 23.10 as the byte string "0 0 0 0 2 3 1 0" (yup: one byte per digit.) Ah, but you need to keep track of whether this is an income or an expense: the number needs to have a sign! In that case, you either want +23.10 or –23.10, which you would represent as either "0 0 0 0 2 3 1 +0" or "0 0 0 0 2 3 1 –0". Indeed, COBOL will print out by default these amounts as " 23.10+" or " 23.10-", with the sign at the end instead of the start of the number. This is a delightful source of hilarity as you try to write modern comptuer code to siphon data out of legacy systems and into modern web pages. Implementing support for arithmetics on numbers represented like this is left as an exercise to the reader, although COBOL does support more computing-friendly ways to represent numbers in memory.
As it happens, this is one of the older ways humans used to represent text for computers to consume: on punch cards. In order to punch in an "A", code 0x31 (binary 0011 0001) you would punch in... a bunch of rectangular holes into an overlong airplane boarding pass-sized card that typically could only fit 80 characters.
This is, incidentally, the historical reason why some people will get upset if you put more than 80 characters in a line of code. Most terminals in the 1980s by default could only show 80 characters per line of text. The actual reason to introduce a limit is readability, but this is why the limit is specifically 80 for code and 72 for git commit messages.
In fact, all the way up until the late 1990s, we were drowning in tables like this: arbitrary tables of assignments of bytes to characters, also known as "codepages." If you open a document with the wrong codepage, hilarity ensues; this was such a common problem in Japan, they coined a name for it: 文字化け (a.k.a. mojibake).
I tried typesetting this document in LyX until I got to 文字化け. Turns out getting LaTeX and friends to mix latin and Japanese in the same document is... basically just not worth the effort? womp.
Part of the problem is that these tables go directly from "character you want to type" to "sequence of bytes that represent that character." As it often happens in computer science, you're coupling those two things too tightly, and introducing an intermediate layer smooths things over and makes everything better. Enter Unicode.
The simple idea behind Unicode is it maps characters to numbers, also known as codepoints. How those numbers are then stored to disk is a separate concern. The table starts like it does in seven bit ASCII:
0 1 2 3 4 5 6 7 8 9 A B C D E F 00000x ------control characters------- 00001x ------control characters------- 00002x ! " # $ % & ' ( ) * + , - . / 00003x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 00004x @ A B C D E F G H I J K L M N O 00005x P Q R S T U V W X Y Z [ \ ] ^ _ ............. etc ............. The very top of a very long table of symbols.
Don't let false familiarity confuse you. The previous table maps the *byte* 0x31 to the character 'A'. This table maps the *number* 0x41 (a.k.a. the number 65) to the character 'A':
$ python3 >>> char(65) 'A'
So while the previous table said the word "POTATO" is encoded with bytes 0x27 0x26 0x12 0x31 0x12 0x26, this other table says the word "POTATO" is made from numbers 0x50 (eighty), 0x4F (seventy-nine), 0x54 (eighty-four), 0x41 (sixty-five), 0x54 (eighty-four again), 0x4F (seventy-nine). Don't be confused just because I'm writing bytes and numbers both in hexadecimal form!
$ python3 >>> [ord(x) for x in "POTATO"] [80, 79, 84, 65, 84, 79]
As it happens, in earlier versions of the Unicode standard (a.k.a. this table mapping characters to numbers), the objective was to cover only modern scripts. "2^16 characters should be enough for everyone," we thought.
Foreshadowing is a literary device in which a writer gives an advance hint of what is to come later in the story.
We still need to store those numbers on disk. There's a bunch of ways you can do that:
Let's go through each one of these options:
UTF-32 basically stores every number as a four-byte bit string. So POTATO becomes:
Symbol P O T A T O Number 0x00050 0x00004f 0x00054 0x00041 0x00054 0x0004f Bytes 00 00 00 50 00 00 00 4f 00 00 00 54 00 00 00 41 00 00 00 54 00 00 00 4f Symbols: 6 Bytes: 24 Bytes/symbol: 4
That's simple, and boring, and as far as English text goes, it's clearly quite inefficient. Let's see how it handles the Japanese word "文字化け" from before:
Symbol 文 字 化 け Number 0x06587 0x05b57 0x05316 0x03051 Bytes 00 00 65 87 00 00 5b 57 00 00 53 16 00 00 30 51 Symbols: 4 Bytes: 16 Bytes/symbol: 4
Hrm, that's better but not good: half of our bytes are still 00s! "Surely we can do better!"
UTF-16 basically stores every number as a two-byte* bit string. Here's what that looks like:
Symbol P O T A T O Number 0x00050 0x00004f 0x00054 0x00041 0x00054 0x0004f Bytes 00 50 00 4f 00 54 00 41 00 54 00 4f Symbol 文 字 化 け Number 0x06587 0x05b57 0x05316 0x03051 Bytes 65 87 5b 57 53 16 30 51 POTATO 文字化け Symbols: 6 4 Bytes: 12 8 Bytes/symbol: 2 2
Oh what a joy! Oh what a thing of beauty! Sure, for English text half of our bytes are still zeros, but the Japanese text gets represented in such a clear, cristalline, simple fashion! Surely nothing will go wrong.
Foreshadowing is a literary device in which a writer gives an advance hint of what is to come later in the story.
Indeed, the demo above seems to have been convincing enough a bunch of undoubtedly smart people decided that UTF-16 was just a plain good idea. Windows picked it up as its internal representation of text. Java picked it up: its "char" type only goes up to 2^16. Every single computer that ever choked and broke on an emoji? It was probably using UTF-16 internally. Whoops!
Yeah, it turns out that soon enough a bunch of stuff needed to be encoded with numbers larger than 65,536. For a long time, this stuff was relatively niche, so developers got to bury their heads in the sand and pretend handling this corner case wasn't a priority. Unrelated: have you ever wondered why some technobros are still bitter about emojis?
Symbol o w o 🐱 Number 0x0006f 0x00077 0x0006f 0x1f431 Bytes 00 6f 00 77 00 6f whoops
Turns out the explanation above is insufficient to explain how UTF-16 works. The handling of cases like this is actually quite complex:
1. Start with the number
2. Take 0x10000 off
3. Split the binary representation of the result in two
4. The first byte is 0xD800 + the top 10 bits of the result
5. The second byte is 0xDC00 + the bottom 10 bits of the result
https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF
The two resulting pairs of bytes are known as "surrogate pairs."
So in our case:
Symbol: 🐱 Step 1. 0x1f431 0b0001 1111 0100 0011 0001 Step 2. 0x0f431 0b0000 1111 0100 0011 0001 Step 3. 0x03d 0b0000 1111 01 0x031 0b00 0011 0001 Step 4. 0xd83d 0b11 0110 0000 1111 01 Step 5. 0xdc31 0b1101 1100 0011 0001
and thus finally:
Symbol o w o 🐱 Number 0x0006f 0x00077 0x0006f 0x1f431 Bytes 00 6f 00 77 00 6f d8 3d dc 31 Symbols: 4 Bytes: 10 Bytes/symbol: 2.5
This *extremely* elegant algorithm has a bunch of consequences:
Aren't computers so normal and fine?
Oh, you haven't seen the half of it. We're really getting into the weeds here, but I really want to impress how bad UTF-16 turns out to be.
>>> [hex(int(x)) for x in "owo🐱".encode("utf-16")] ['0xff', '0xfe', '0x6f', '0x0', '0x77', '0x0', '0x6f', '0x0', '0x3d', '0xd8', '0x31', '0xdc']
that maps to this interpretation of the bytes:
Symbol o w o 🐱 Number 0x0006f 0x00077 0x0006f 0x1f431 Bytes ff fe 6f 00 77 00 6f 00 3d d8 31 dc Symbols: 4 Bytes: 12 Bytes/symbol: 3
0xfffe is a byte order mark (BOM), saying that the bytes are to be interpreted in little endian, i.e. byte-wise inverted (so the number 0x01234567 gets represented by the bytes 0x67452301). For reasons™, many computer architectures are little endian (as opposed to big endian, where no such swapping occurs), whereas networks use big endian. Some people decided that it would be really convenient to be able to dump bytes out in either little or big endian; in the former case, the string gets prefixed with 0xfffe; in the latter case, the string gets prefixed with 0xfeff.
UTF-16 stores every number as a one-byte bit string. Okay, clearly that can't be the whole story... but so long as the numbers stay under 128, it is!
Symbol P O T A T O Number 0x00050 0x00004f 0x00054 0x00041 0x00054 0x0004f Bytes 50 4f 54 41 54 4f
For other numbers the scheme looks like this:
0. Determine how many bytes you need
1. The first byte starts with 0b11
2. Add an extra 1 for how many extra byte you end up needing after the second
3. Add a 0 bit as some kind of separator
4. Use the remaining bits to start representing the bits that make up the number
5. Every extra byte starts with 0b10 and uses the other six bits for the number
So let's go back to our smiley cat and see what happens:
Symbol: 🐱 Number: 0x1f431 or 0b1 1111 0100 0011 0001 (17-bits)
Will it fit in two UTF-8 bytes? That would be:
0b110xxxxx 10xxxxxx (16 bits, 11 bits for the number, 5 overhead)
(...where every 'x' is storage allocated for the Unicode number of a given symbol.) That's no good. What about three UTF-8 bytes?
0b1110xxxx 10xxxxxx 10xxxxxx (24 bits, 16 bits for the number, 8 overhead)
Still nope, just by a little. Will four be enough?
0b11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (32 bits, 21 bits for the number, 11 overhead)
That'll work. Time to fill in the gaps:
Symbol: 🐱 Number: 0b000 011111 010000 110001 Format: 0b11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Result: 0b11110000 10011111 10010000 10110001 0xf0 0x9f 0x90 0xb1
Let's check our homework:
$ python >>> "🐱".encode('utf-8') b'\xf0\x9f\x90\xb1'
So how do we fare on our other test strings?
Symbol 文 字 化 け Number 0x06587 0x05b57 0x05316 0x03051 Bytes e6 96 87 e5 ad 97 e5 8c 96 e3 81 91 Symbol o w o 🐱 Number 0x0006f 0x00077 0x0006f 0x1f431 Bytes 6f 77 6f f0 9f 90 b1 POTATO 文字化け owo🐱 Symbols: 6 4 4 Bytes: 6 12 7 Bytes/symbol: 1 3 1.75
Consequences of this encoding scheme:
As you can see the result is kind of a mixed bag. This is a triumph of eurocentrism, but Japanese text doesn't do so hot compared to UTF-16, for example. However, UTF-8 is explicitly a variable-length text encoding. When emojis showed up, software written in UTF-8 just worked, whereas stuff written UTF-16 had its assumptions broken and found itself to be brittle and complicated, to the point where stuff like Java passes the problems onto you, good luck and have fun.
So this is probably part of the reason why UTF-8 ended up being the de-facto standard encoding of the internet.
What also helps is that the Unix/Linux/BSD world bet on UTF-8 over UTF-16, and that's a majority of servers in the world run.
However, you might think, 3 bytes per symbol is not THAT much worse than 4 bytes per symbol. Being able to grab a chunk of 4 kilobytes of UTF-32 text and assert that will certainly decode to 1,024 valid symbols has some kind of intrinsic value, right?
Nevermind the part where UTF-16 created this 0xd8xx-0xdfxx no-go zone for Unicode numbers...
One legitimately useful use of byte order marks is for strings to embed information about their encoding. As explained before, if a sequence of bytes starts with 0xfeff or 0xfffe, you can reasonably guess that your document is encoded with UTF-16. Other UTF encodings also support byte order marks, although they typically aren't in use:
* The byte order marks in UTF-32 are 0x0000feff and 0xfffe0000.
* The only supported "byte order mark" in UTF-8 is 0xefbbbf. It's not really a byte order mark, but rather a sequence of magic bytes that does nothing but scream "this is UTF-8".
Magic bytes, also known as "file signatures" are tell-tale signs that a given binary blob can be interpreted in a certain way. For example, all PDF files start with "%PDF-".
In the bad old days of codepages, European languages tried desperately to fit into one byte per symbol.
This was a doomed effort for languages like Chinese and Japanese, which developed their own microcosm of ways to deal with the richness of their rune sets, ways that required dedicated specialized versions of Windows or DOS. Western Windows XP for example can't run Japanese games from that era without installing a bunch of additional software.
Googling "japanese games western windows xp" automatically disables safe search. Google assumes that, if you're doing this, you must be trying to play obscure hentai games ("eroge"s).
This meant that western european text (code page 1252 or Latin-1), central european text (code page 1250 or* Latin-2), greek text (code page 1253), turkish text (code page 1254), etc. was all mutually incompatible. After all, Latin-2 languages needed to be able to represent a Ř, Latin-1 languages needed to be able to represent a "Ë", and the reverse wasn't true.
Is this a good point to mention that code page 1250 is Windows' idea of "Latin-2", separate and different from the standardized ISO/IEC 8859-2 character encoding? Code pages 1252 and ISO/IEC 8859-2 "Latin-1" don't have these disagreements. Again, computers have their problems, but a lot of them can be traced to the people behind them.
Oh, also a bunch of these codepages had to update in 2001 to make room for the Euro, so what's commonly known as "codepage 1250" actually refers to Coded Character Set Identifiers 5346 and... yeah those are enough tangents even for me.
One wart of this system is that, while Latin-1 gets to enjoy freebies like ¼ or ², Latin-2 actually doesn't have enough room for every character-diacritic combination. Not to worry: the codepage has room for standalone diacritics, so you can combine the ˛ diacritic (0xB2) with the A symbol (good old 0x41) to make Ą (0xB2 0x41)... a character that's also available on its own as 0xA5. Oh no!
Suddenly there's no longer just one way to turn characters into symbols, a characteristic that Unicode maintains thanks to backwards compatibility (and other similar issues that result from combining CJK characters, I don't understand this part well enough). You end up in situations where you might need two symbols to make one character, or where the same character can be represented in multiple different ways!
In order to keep sanity, Unicode introduces two important concepts: "normalization" and "grapheme clusters."
"Normalization" acknowledges that there's more than one way to turn text into numbers; if this bothers you, you can pick your favourite way and computer code you don't have to write will either turn all the ˛A's into Ą's or viceversa.
"Grapheme clusters" is a more interesting concept: it basically tries to formalize the idea of character, or what should happen when you're in a text editor and you press the left or right button. The user wants to move forward to the next "character", but where is that?
The concept of grapheme clusters ruins the only good thing UTF-32 had going for it: yeah, you can grab 4 kilobytes of text to grab 1,024 symbols, but doing so might split a grapheme cluster apart, resulting in broken text in both chunks. Unfortunately reality isn't so accomodating for developers!
In other news, you now know that the Correct answer to "what is the length of a string?" is now "trick question! do you want to know how many grapheme clusters there are, how many unicode symbols, or how many bytes?" Most programming languages will give you the easy answer, and leave the accurate answer to dedicated libraries.
That also means you can't just check if two binary strings are equal to conclude that their text is equal, or even that they're using the same unicode numbers.
Let's look for example at the "Hangul Syllable Gag", 각. That character is made by the symbols ㄱ, ㅏ and ㄱ arranged together into the same square like syllables into a word.
$ pip install grapheme unicodedata $ python3 >>> import grapheme, unicodedata >>> x = "각" # composed character >>> y = unicodedata.normalize("NFD", x) # Normal Form Decomposed >>> y '각' >>> [hex(ord(i)) for i in x] ['0xac01'] >>> [hex(ord(i)) for i in y] ['0x1100', '0x1161', '0x11a8'] >>> len(x), grapheme.length(x) (1, 1) >>> len(y), grapheme.length(y) (3, 1) >>> x == y False
Not only do computers handle 각 and 각 differently, on Windows they render differently, and the terminal emulator I'm using to write this staunchly refuses to render the latter text correctly. For bonus points, note that the denormalized 각 version and the symbols "ㄱㅏㄱ" from above are made by DIFFERENT unicode numbers that do NOT normalize to one another.
>>> z = 'ㄱㅏㄱ' >>> [hex(ord(i)) for i in z] ['0x3131', '0x314f', '0x3131'] >>> unicodedata.normalize("NFC", z) # Normal Form Composed, the opposite of NFD 'ㄱㅏㄱ' >>> _ == z True
I'm sure this makes perfect sense to Korean writers, but it did trip me up while writing this section. :)
Written Japanese is a language that's sometimes phonetic (= what you say* is what you write) and sometimes ideographic (= a concept maps to one or a few runes).
That's not 100% true; you're still missing information about intonation, stress, etc. but I'm not aware of a phonetic writing system that does encode that information always. Italian uses diacritics for this; for example, 'dà" (he gives) vs "da" (from); for a more old-fashioned example, ancòra (still) vs àncora (anchor).)
Phonetically written Japanese maps relatively cleanly to the Latin alphabet, resulting in Romaji, Japanese you can type on your QWERTY keyboard. As you type every word, your computer asks you if you want an ideogram or a phonetic transcription.
I mean QWERTY keyboard not to say that you can't type Japanese on a DVORAK keyboard but to distinguish it from other input methods that either only work on a touch screen (like the flick keyboard)) or map a physical key to a specific syllable (the thumb-shift method).
I'll teach you this one test sentence — the only one I know:
"I am American"
This is how you go from English to Romaji (what you type) to Japanese (and whether you're using ideographic or phonetic writing)
I → watashi → 私 (ideogram) [, the topic of this conversation] → ha → は (phonetic*) America- → amerika → アメリカ (alternative phonetic) -n (a person) → jin → 人 (ideogram) am → desu → です (phonetic)
the expected output is like so, without any spaces:
私はアメリカ人です
and you can get it by typing:
watashi ha amerika jin desu
but at every step there you're gonna have to stop and tell the computer what it is you actually want to write.
Thus:
With this knowledge now you get to:
If at any time you can't find the completion you're looking for, that means you typed the wrong thing, or you tried to pick a completion at the wrong time. Even on mobile, there's little tolerance for typoes, and the home row on Android is a little wider than on the English keyboard layout, something that made me typo a lot while writing this. Sometimes, though, you can get away with a lot: I just typed in the whole thing without spaces, "watashiwaamerikajindesu", on Windows 11 and the Microsoft input method managed to get it perfectly right: 私はアメリカ人です.
Thanks to @Lumidaub@social.social.tchncs.de for the corrections.
Caution: this is the section with perl 5 in it.
Perl is, vaguely speaking, an evolution of awk and ed. I could write dozens of paragraphs about Perl, but the main purpose here is to impress upon you the kind of issues that can arise when you get Unicode stuff really wrong.
Without worrying too much about the syntax, this is how you would write cat in perl:
$ cat cat.pl while(my $line = <STDIN>) { print $line; }
Let's put it in action!
$ echo 私はアメリカ人です | perl cat.pl 私はアメリカ人です
Damn, we're good at this unicode stuff, aren't we?
Foreshadowing is...
Well, let's say that, emboldened by our success, we want to implement tac, a program that gives you your text backwards. We already know this is far from a trivial endeavour but let's see how far away from success we are. The code is very similar, we just need to juggle the newlines around a little:
$ cat tac.pl while(my $line = <STDIN>) { chomp $line; print scalar reverse $line; print "\n"; }
Let's try it in the simple case:
$ echo hello | perl tac.pl olleh
...and now with japanese!
$ echo 私はアメリカ人です | perl tac.pl 㧁㺺䫂㪃㡃㢂㯁で
Not only did we get garbage back, but we got the WRONG amount of garbage back: 9 runes went in, eight came out. Indeed:
$ echo 㧁㺺䫂㪃㡃㢂㯁で | perl tac.pl はアメリカ人で
...a whole bunch of stuff went missing! Where did the 私 go?
Clearly something's funky with the terminal, or at least the way the program interfaces with it. Let's see if this makes any difference:
$ cat wtf.pl my $line = "私はアメリカ人です"; print scalar reverse $line; print "\n"; $ perl wtf.pl 㧁㺺䫂㪃㡃㢂㯁で
Hrm, same garbage as before... but some very casual googling says that the incantation "use utf8;" might help?
$ cat wtf.pl use utf8; my $line = "私はアメリカ人です"; print scalar reverse $line; print "\n"; $ perl wtf.pl Wide character in print at wtf.pl line 3. すで人カリメアは私
oh hey! That's what we want! But there's also some odd warning about wide characters? What's going on with that?
Google says you can get rid of that one with this incantation: "binmode STDOUT, 'utf8';" and sure enough:
$ cat wtf.pl use utf8; binmode STDOUT, 'utf8'; my $line = "私はアメリカ人です"; print scalar reverse $line; print "\n"; $ perl wtf.pl すで人カリメアは私
The binmode line basically tells Perl that STDOUT is not Latin-1, but UTF-8*.
Let's patch the original program:
$ cat tac.pl use utf8; binmode STDOUT, 'utf8'; while(my $line = <STDIN>) { chomp $line; print scalar reverse $line; print "\n"; } $ echo 私はアメリカ人です | perl tac.pl ã§ãººä«ãªã¡ã¢ã¯ã§ç
Okay, that's actually worse than what we started with. Oh, and while the previous program turned few characters into fewer characters, this one does the opposite in a mojibake explosion:
$ echo ã§ãººä«ãªã¡ã¢ã¯ã§ç | perl tac.pl §Ã§Â£Ã¯Â£Ã¢Â£Ã¡Â£ÃªÂ£Ã«Â¤ÃºÂºÂ£Ã§Â£Ã
It's time to unbury the lede:
Sure enough, applying ALL the fixes results in a correct program:
$ cat tac.pl use utf8; # now unnecessary binmode STDOUT, 'utf8'; binmode STDIN, 'utf8'; while(my $line = <STDIN>) { chomp $line; print scalar reverse $line; print "\n"; } $ echo 私はアメリカ人です | perl tac.pl すで人カリメアは私
Well, a correct-er program. As we already know, what we really want isn't revering each Unicode symbol, but reversing every Unicode grapheme cluster. Perl has facilities to handle this, but that's outside of the scope of this conversation.
But this begs the question: if our original program was so wrong, how did it even work?
$ cat cat.pl while(my $line = <STDIN>) { print $line; } $ echo 私はアメリカ人です | perl tac.pl 私はアメリカ人です
Based on what I told you, This program clearly expects to be dealing with nothing but eight-bit ASCII, but seems to handle Unicode like a champ! What gives? What was the deal with ã§ãººä«ãªã¡ã¢ã¯ã§ç and 㧁㺺䫂㪃㡃㢂㯁で?
It turns out the bugs in cat.pl cancel each other out:
1. The terminal sends UTF-8 in: 私はアメリカ人です
2. The program takes in garbage Latin-1: ç§ã¯ã¢ã¡ã��ã«äººã§ã
3. The program spits out garbage Latin-1: ç§ã¯ã¢ã¡ãªã«äººã§ã
4. The terminal takes the garbage and interprets it as UTF-8: 私はアメリカ人です
Since the program didn't change the byte stream at all, the fact perl didn't touch it at all means it got away with misunderstanding things completely: the bugs cancelled each other out.
In order to have cat.pl do what it's doing correctly, you have to fix BOTH the input and the output. If only apply just one of the fixes, the program misbehaves. Nasty!
As a complicating factor, let's look at what happened when we got 㧁㺺䫂㪃㡃㢂㯁で. To recap:
$ cat broken.pl while(my $line = <STDIN>) { chomp $line; print scalar reverse $line; print "\n"; } $ echo 私はアメリカ人です | perl broken.pl 㧁㺺䫂㪃㡃㢂㯁で $ echo 㧁㺺䫂㪃㡃㢂㯁で | perl broken.pl はアメリカ人で
As you can see the errors compound much like they did with ç§ã¯ã¢ã¡ãªã«äººã§ã and §Ã§Â£Ã¯Â£Ã¢Â£Ã¡Â£ÃªÂ£Ã«Â¤ÃºÂºÂ£Ã§Â£Ã. However!
$ echo 私はアメリカ人です | perl broken.pl | perl broken.pl 私はアメリカ人です
How is this possible? How did we get the original string back if we call the program like this?
Well, this is what just happened:
1. The terminal sends UTF-8 in: 私はアメリカ人です
2. The first program takes in garbage Latin-1: ç§ã¯ã¢ã¡ãªã«äººã§ã
3. The first program spits out garbage Latin-1: ã§ãººä«ãªã¡ã¢ã¯ã§ç
4. The second program takes in garbage Latin-1: ã§ãººä«ãªã¡ã¢ã¯ã§ç
5. The second program takes in garbage Latin-1: ç§ã¯ã¢ã¡ãªã«äººã§ã
6. The terminal takes the garbage and interprets it as UTF-8: 私はアメリカ人です
Basically running broken.pl twice is the same as the broken cat.pl, just with more steps. But then what's with 㧁㺺䫂㪃㡃㢂㯁で?
Let's analyze what happens when we run broken.pl just once:
1. The terminal sends UTF-8 in: 私はアメリカ人です
2. The first program takes in garbage Latin-1: ç§ã¯ã¢ã¡ãªã«äººã§ã
3. The first program spits out garbage Latin-1: ã§ãººä«ãªã¡ã¢ã¯ã§ç
4. The terminal tries to interpret this as UTF-8, but the byte stream doesn't quite work as UTF-8 text. The terminal skips what it can't understand and prints the rest.
We can verify this is what's going on by using hexdump to print out the actual byte sequence:
$ echo 私はアメリカ人です | perl broken.pl | hexdump -C 00000000 99 81 e3 a7 81 e3 ba ba e4 ab 82 e3 aa 83 e3 a1 |................| 00000010 83 e3 a2 82 e3 af 81 e3 81 a7 e7 0a |............| 0000001c
Byte 0x99 is written as 0b10011001. As we now know, that byte starts with 0b10, meaning it's a sequence continuation byte with no corresponding 0b11 character to start the sequence. That's invalid UTF-8. If it was to Python this would already be game over:
$ echo 私はアメリカ人です | perl broken.pl | python -c 'import sys; print(sys.stdin.read())' UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 0: invalid start byte
The terminal is more forgiving than that, however. So what if0x99 starts with 0b10? The terminal just ignores it. The next byte, 0x81 also starts with 0b10: that gets discarded too. After that, we get three bytes that happen to start with 0b1110, 0b10 and 0b10: 0xe3a781 looks like valid UTF-8, so the termianl spits out the corresponding character, which happens to be 㧁. Let's check our homework:
$ python3 >>> '㧁'.encode('utf-8') b'\xe3\xa7\x81'
The terminal is discaring bytes left and right, so when we try and use what the terminal shows us to reverse the result... it should be no surprise that text goes missing and we only get はアメリカ人で.
Take-aways:
Being able to confuse bytes and text, and the bugs that result from that, is the main reason for all of the gnashing of teeth that happened when Python 3 was released and it tried to force developers to get this stuff right, whereas Python 2 wallowed in this confusion.
[...]