I was wondering, how likely is it that random bytes would happen to be valid UTF-8? Out of curiousity I decided to try and work it out.
First of all, what exactly does "valid UTF-8" mean? Looking at the Wikipedia article, we can see the precise definition. All ASCII bytes are valid in UTF-8 (i.e. 0x00 to 0x7f), as well as multibyte sequences, which can have one of these forms:
0bbb bbbb 110b bbbb 10bb bbbb 1110 bbbb 10bb bbbb 10bb bbbb 1111 0bbb 10bb bbbb 10bb bbbb 10bb bbbb
`b` means that the bit in that position may vary, whereas the `0` and `1` bits are fixed by the rules of UTF-8.
So, a valid UTF-8 string must consist of a sequence of these one to four-byte sequences, but there are some additional constraints we need to consider. Valid codepoints are only between 0 and 0x10ffff, so the largest possible four-byte sequence is:
0x10ffff is 100001111111111111111 in binary. In UTF-8, that binary number becomes: 100 00 1111 11 1111 11 1111 <- 0x10ffff 1111 0 10 10 10 <- 4-byte UTF-8 pattern 1111 0100 1000 1111 1011 1111 1011 1111 <- result
An additional constraint is that all codepoints must be encoded in the shortest possible byte sequence, so we can't turn a 1-byte codepoint into, say, a 2-byte codepoint by prepending a bunch of zeroes.
Finally, surrogate codepoints (0xd800 to 0xdfff) are invalid in UTF-8, so we need to be careful to exclude those as well.
All of this can be summarised in this nice table:
Btw, I used diagrams.net to make it, would recommend :)
In that table, I drew ASCII bytes in light grey, continuation bytes in blue, and valid starting bytes in green. Shaded green bytes (namely 0xe0, 0xed, 0xf0 and 0xf4) have restrictions around which continuation byte sequences are valid.
Armed with all this, we can start thinking about the actual probabilities!
Length N P(UTF-8) ------------------------- 0 1 100% 1 128 50%
In this table, `N` is the total number of valid UTF-8 strings, while `P(UTF-8)` is the percentage/probability of valid byte sequences (i.e. the probability that a random byte sequence is valid UTF-8). Mathematically, `P(UTF-8) = N/256^n`, where `n` is the length.
These two cases are pretty straightforward, since we don't need to consider multibyte sequences yet.
For 2-byte codepoints, there are 30 possible starting bytes (0xc2 to 0xdf), each of which can have 64 continuation bytes. We can multiply those numbers to get 1920 valid two-byte sequences. We could also have two ASCII bytes, with 128 * 128 = 16384 possible cases. The overall likelihood is thus (1920 + 16384) / 256^2, which is approximately 27.9%.
So our overall table is:
Length N P(valid UTF-8) ------------------------------- 0 1 100 % 1 128 50 % 2 18304 approx. 27.9%
What about three-byte sequences? Let's start by working out the total number of them.
Looking again at the codepage layout table, we can see that there are 14 possible "uncontroversial" starting bytes, which can have any continuation bytes. This leads to 14 * 64^2 possible cases.
For the starting byte 0xe0, we run into potential overlong encodings. Codepoints with 11 bits or fewer can be represented as a two-byte sequence, so the smallest possible three-byte sequence is:
e 0 a 0 8 0 1110 0000 1010 0000 1000 0000
The second byte must be at least 0xa0, while the final byte can be anything (well, any normal continuation byte between 0x80 and 0xbf). This leads to 1 * 32 * 64 possible cases.
Next, we need to look at the starting byte 0xed, where we need to avoid counting surrogate codepoints, which are between 0xd800 and 0xdfff. This corresponds to the binary range:
0xd800 to 0xdfff in binary: 1101100000000000 to 1101111111111111 in UTF-8: 1110 1101 1010 0000 1000 0000 (in hex: ed a0 80) 1110 1101 1011 1111 1011 1111 (in hex: ed bf bf)
Coincidentally, the second byte has exactly the same 0xa0 to 0xbf range that we saw before! This results in another 1 * 32 * 64 possible cases.
Adding everything up, we get:
14 * 64^2 + 1 * 32 * 64 + 1 * 32 * 64 = 61440
This number is the total count of three-byte sequences, but if we want to count all valid three-byte UTF-8 strings, we need to include a few more possibilities:
61440 (three-byte sequences) + 1920 * 128 (two-byte sequence, followed by ASCII) + 128 * 1920 (ASCII followed by a two-byte sequence) + 128 * 128 * 128 (three ASCII characters) = 2650112
Since we're looking for a probability, we need to divide that result by 256^3, which is approximately 15.8%! Adding this to our table, we get:
Length N P(valid UTF-8) ---------------------------------- 0 1 100 % 1 128 50 % 2 18304 approx. 27.9% 3 2650112 approx. 15.8%
Let's now look at four-byte sequences! Four-byte sequences in UTF-8 can start with 0xf0 to 0xf4. Of those, 0xf1, 0xf2 and 0xf3 support all continuation byte sequences, so there are 3 * 64 * 64 * 64 = 786432. Codepoints consisting of 16 or fewer bits are represented by three-byte sequences, so the first 17-bit codepoint is represented as:
1 0000 0000 0000 0000 -> 1111 0000 1001 0000 1000 0000 1000 0000 in hex: f0 90 80 80
From this, we can see that the second continuation byte has to be between 0x90 and 0xbf, so there are 48 possibilities. Overall, this means there are 1 * 48 * 64 * 64 = 196608 UTF-8 sequences starting with 0xf0.
The final possible starting byte, 0xf4, is limited because the highest possible codepoint is at 0x10ffff:
0x10ffff -> in binary: 1 0000 1111 1111 1111 1111 -> to UTF-8: 1111 0100 1000 1111 1011 1111 1011 1111 -> to hex: f4 8f bf bf
Thus, the second byte can be in the range 0x80 to 0x8f, i.e. there are 16 possibilities. Multiplying everything, we get 1 * 16 * 64 * 64 = 65536 sequences starting with 0xf4.
Looking at all four-byte UTF-8 sequences, there are 786432 + 196608 + 65536 = 1048576 possibilities.
To check our work thus far, we can try and count all codepoints, to see if it adds up to the expected total of 1112064 (i.e. the total number of Unicode scalar values, or codepoints minus surrogates, or 0x10ffff - 0x800 + 1):
128 (ASCII bytes) + 1920 (two-byte sequences) + 61440 (three-byte sequences) + 1048576 (four-byte sequences) = 1112064 (total number of valid UTF-8 sequences)
Unicode Glossary: Unicode Scalar Value
To count all possible four-byte UTF-8 strings (that may consist of multiple codepoints as well as just one codepoint), we'll need to add up all the different cases:
1048576 # four-byte codepoint + 61440 * 128 * 2 # three-byte codepoint followed by ASCII, or other way around + 1920 * 1920 # 2 two-byte codepoints + 1920 * 128 * 128 * 3 # two-byte codepoint and 2 ASCII codepoints, 3 arrangements + 128^4 # 4 ASCII codepoitns = 383270912
As a probability, 383270912 / 256^4 is approximately 8.9%.
Length N P(valid UTF-8) ------------------------------------ 0 1 100 % 1 128 50 % 2 18304 approx. 27.9% 3 2650112 approx. 15.8% 4 383270912 approx. 8.9%
UTF-8 doesn't allow individual codepoints longer than four bytes. But if we want to count all possible UTF-8 strings, we can, although at this point it makes sense to try and generalise this to any byte length.
f(n) = 128 * f(n-1) + 1920 * f(n-2) + 61440 * f(n-3) + 1048576 * f(n-4)
This recurrence relation describes how, to work out how many byte strings of length n could be valid UTF-8, we can either:
Wikipedia: Recurrence relation
Wikipedia: Linear recurrence with constant coefficients
Wikipedia: Constant-recursive sequence
If we define f(n) to be 0 for negative numbers, we can even avoid hard-coding the special cases (n in range 1-4) that we've already worked out manually! Here's how this function could be implemented in JavaScript:
function f(n) { if (n < 0) { return 0n; } if (n == 0) { return 1n; } return 128n * f(n-1) + 1920n * f(n-2) + 61440n * f(n-3) + 1048576n * f(n-4); }
The `n` suffix means that this is using arbitrary-precision integers, a relatively recent JavaScript feature.
Arbitrary-precision integers in JavaScript
Here's another implementation in Ruby, which conveniently supports arbitrary-precision arithmetic by default:
def f(n) return 0 if n < 0 return 1 if n == 0 128 * f(n-1) + 1920 * f(n-2) + 61440 * f(n-3) + 1048576 * f(n-4) end
And here are the first few values of this function:
f( 0) = 1 f( 1) = 128 f( 2) = 18304 f( 3) = 2650112 f( 4) = 383270912 f( 5) = 55405707264 f( 6) = 8009826697216 f( 7) = 1157963783864320 f( 8) = 167404246927409152 f( 9) = 24201254918904872960 f(10) = 3498720990639933095936 ... f(20) = 13951585587226318479029158620550762856448000
So we've found that this sequence grows very rapidly, but how quickly does it grow exactly? Let's start by testing some numbers:
f(1) / f(0) = 128 f(2) / f(1) = 143 f(3) / f(2) = 144.78321678321677 f(4) / f(3) = 144.6244204018547 f(5) / f(4) = 144.56016757149575 f(6) / f(5) = 144.56681617744468 f(7) / f(6) = 144.5678948667887 f(8) / f(7) = 144.5677742776661 f(9) / f(8) = 144.5677476115595
The growth rate seems to converge to approximately 144.57.
We can therefore say that the probability of a random byte sequence being valid UTF-8 tends to (144.57 / 256)^n.
To work out the growth rate more precisely, we can solve the recurrence relation by finding the characteristic polynomial. This kind of goes beyond what I ever learned in maths, so I'm mostly just following Wikipedia. Feel free to let me know if you spot any mistakes or omissions!
General solutions for linear recurrences
The characteristic polynomial is:
p(l) = l^n - 128 * l^(n-1) - 1920 * l^(n-2) - 61440 * l^(n-3) - 1048576 * l^(n-4)
Equivalently:
p(n) = n^4 - 128 * n^3 - 1920 * n^2 - 61440 * n^ - 1048576
Since this is a standard fourth-order polynomial, we can find its roots. What we find is that the root we're looking for is (via Wolfram Alpha):
n = 32 + 4 sqrt(2 (42 + (2 sqrt(450831306) - 30609)^(1/3)/3^(2/3) - 661/(3 (2 sqrt(450831306) - 30609))^(1/3))) + 1/2 sqrt(10752 - (128 (2 sqrt(450831306) - 30609)^(1/3))/3^(2/3) + 84608/(3 (2 sqrt(450831306) - 30609))^(1/3) + 55808 sqrt(2/(42 + (2 sqrt(450831306) - 30609)^(1/3)/3^(2/3) - 661/(3 (2 sqrt(450831306) - 30609))^(1/3))))
This exactly describes the growth rate, and is approximately equal to:
n = approx. 144.56775126843762365502758894749124000555...
At this point I'm somewhat out of ideas, and a bit out of my depth, so I think I'll end it here. Learning that the per-byte probability converges to roughly 56% is definitely kind of interesting though. I wonder how other encodings compare: ASCII is obviously 50%, but I have no idea what it would be for, say, UTF-16 or Shift JIS. Maybe I'll write a part 2 someday 😅