💾 Archived View for thrig.me › blog › 2023 › 08 › 15 › is-it-utf8.gmi captured on 2024-07-09 at 01:08:40. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Is it UTF-8?

Can random bytes be valid UTF-8?

Another method is to brute force the problem with a Monte Carlo simulation, which at best will waste lots of CPU. However, where Monte Carlo is practical this may help confirm that a better approach has not gone off the rails. Or, that both the Monte Carlo and the better approach are wrong in ways that produce the same results, which maybe isn't so likely? Richard Feynman said things about actually verifying results in his 1974 commencement address at Caltech.

    // istuf8 - what are the odds of random bytes being valid UTF-8?
    #include <assert.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    // doas pkg_add libutf8proc
    #include <libutf8proc/utf8proc.h>
    
    double
    byteaudit(size_t len, size_t trials)
    {
        size_t pass = 0;
        assert(len > 0);
        utf8proc_int32_t ref;
        utf8proc_uint8_t *in = malloc(len);
        if (!in) abort();
        for (size_t i = 0; i < trials; i++) {
            // PORTABILITY this may need libbsd or something else
            arc4random_buf(in, len);
            utf8proc_uint8_t *ip  = in;
            utf8proc_ssize_t left = (utf8proc_ssize_t) len;
            while (1) {
                utf8proc_ssize_t ret = utf8proc_iterate(ip, left, &ref);
                if (ret < 1) break;
                ip += ret;
                left -= ret;
                if (left == 0) {
                    pass++;
                    break;
                }
            }
        }
        free(in);
        return (double) pass / (double) trials * 100.0;
    }
    
    int
    main(void)
    {
        for (size_t i = 1; i <= 8; i++)
            printf("%lu %.2lf\n", i, byteaudit(i, 10000000U));
    }

isutf8.c

    $ CFLAGS=`pkg-config --cflags --libs libutf8proc` make isutf8
    cc -I/usr/local/include -DUTF8PROC_EXPORTS -L/usr/local/lib -lutf8proc   -o isutf8 isutf8.c
    $ ./isutf8
    1 49.97
    2 27.95
    3 15.79
    4 8.93
    5 5.05
    6 2.85
    7 1.61
    8 0.91

These results agree fairly well with those from the original article, even if they do chew up a lot more CPU getting there. Someone with actual stats chops could doubtless do better with distributions and p-values than my "fairly well" line.

Length   P(valid UTF-8)   Monte Carlo
-------------------------------------
     1            50  %      ~ 49.97% 
     2          ~ 27.9%      ~ 27.95%
     3          ~ 15.8%      ~ 15.79%
     4          ~  8.9%      ~  8.93%

My numbers were wildly wrong until I got all the obvious bugs out of the code. For instance code that answers "can valid UTF-8 be found at the beginning of a buffer?" will get rather different results. A program might stop reading at the first NUL or '\0' and thus ignore gibberish beyond that that another program would fail on having read past that NUL. So it will depend on the data, how it is parsed before reaching any Unicode library, and then how that library and the code that calls the library deals with what the initial parse did. In other words, it can be complicated.

Less Terrible Code

The above code works. However, it is hilariously inefficient, especially for shorter byte lengths where one could simply test each value. Also fewer trials could be used to obtain good enough results, especially if you have older and slower hardware, and enough statistical chops to be confident that the results are indeed good enough.

isutf82.c

    $ CFLAGS=`pkg-config --cflags --libs libutf8proc` make isutf82
    ...
    $ ./isutf82 > out.txt
    $ sed 3q out.txt
    50.00 27.93 15.72 8.96 5.29 2.79 1.58 0.94
    50.00 27.93 15.28 9.09 5.11 2.85 1.50 0.93
    50.00 27.93 16.46 9.51 5.18 2.81 1.75 0.82

out.txt

This version of the code checks each value when the length is 1 or 2. Otherwise, the number of trials is lower, but the check is done 30 times for each length. This allows the mean and standard deviation of the results to be calculated, here with R:

    $ r-fu colsumm out.txt
            V1    V2         V3         V4         V5         V6         V7
    Count   30 30.00 30.0000000 30.0000000 30.0000000 30.0000000 30.0000000
    Range    0  0.00  1.7600000  1.1400000  1.0100000  0.6200000  0.4600000
    Mean    50 27.93 15.7496667  8.9846667  5.0646667  2.8093333  1.5923333
    Sdev     0  0.00  0.3648097  0.3219952  0.2227219  0.1508124  0.1064047
    Min.    50 27.93 14.7000000  8.3900000  4.6100000  2.4400000  1.3500000
    1st Qu. 50 27.93 15.5600000  8.8100000  4.8700000  2.7200000  1.5100000
    Median  50 27.93 15.7350000  8.9400000  5.1300000  2.8100000  1.5850000
    3rd Qu. 50 27.93 15.9600000  9.2100000  5.1900000  2.9200000  1.6600000
    Max.    50 27.93 16.4600000  9.5300000  5.6200000  3.0600000  1.8100000
                     V8
    Count   30.00000000
    Range    0.34000000
    Mean     0.92933333
    Sdev     0.08780595
    Min.     0.74000000
    1st Qu.  0.85000000
    Median   0.93500000
    3rd Qu.  0.98000000
    Max.     1.08000000

R wrapper for CLI use

Much less CPU, and still pretty good results. Again, someone with better statistical chops could make better use of the numbers here.

Length   P(valid UTF-8)   Monte Carlo   MCII Mean
-------------------------------------------------
     1            50  %      ~ 49.97%      50   %
     2          ~ 27.9%      ~ 27.95%    ~ 27.93%
     3          ~ 15.8%      ~ 15.79%    ~ 15.75%
     4          ~  8.9%      ~  8.93%    ~  8.98%

tags #feynman #utf8