💾 Archived View for thrig.me › blog › 2023 › 11 › 05 › unicode-no-match.gmi captured on 2024-08-18 at 18:08:57. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-11-14)

-=-=-=-=-=-=-

Unicode No Match

    $ cat badmatch.pl
    #!/usr/bin/env perl
    use 5.36.0;
    use utf8;
    say "శ్లొకముల" =~ /శ్లోక/ ? "match" : "nomatch";
    $ perl badmatch.pl
    nomatch

badmatch.pl

This issue could depend on the font or software used. A terminal may truncate characters because the box is too small, or the font might be too small for someone to see the difference. Probably you should have a tool to tell you what the various characters are.

    $ perl -nle 'print for /([^\000-\177]+)/g' badmatch.pl
    శ్లొకముల
    శ్లోక
    $ perl -nle 'print for /([^\000-\177]+)/g' badmatch.pl |
    > while read l; do whatchar $l; echo; done
    [శ] Lo U+0C36 TELUGU LETTER SHA
    [్] Mn U+0C4D TELUGU SIGN VIRAMA
    [ల] Lo U+0C32 TELUGU LETTER LA
    [ొ] Mn U+0C4A TELUGU VOWEL SIGN O
    [క] Lo U+0C15 TELUGU LETTER KA
    [మ] Lo U+0C2E TELUGU LETTER MA
    [ు] Mc U+0C41 TELUGU VOWEL SIGN U

    [శ] Lo U+0C36 TELUGU LETTER SHA
    [్] Mn U+0C4D TELUGU SIGN VIRAMA
    [ల] Lo U+0C32 TELUGU LETTER LA
    [ో] Mn U+0C4B TELUGU VOWEL SIGN OO
    [క] Lo U+0C15 TELUGU LETTER KA

See the difference yet? If you've got a big font and are used to reading the language in question this might have been a lot easier. A terminal with a big font is good to keep around, if you're a terminal dweller like I am. And with encoding issues the use of screenshots in addition to text can be a good thing.

bigterm.png

One might still miss the little extra fiddly bit?

Do note that some copy and paste code may modify (normalize) strings; this may complicate debugging. Maybe urxvt does this? Something to be aware of if you are copying and pasting the data to a debugging tool. Probably when matching on Unicode you should normalize things first, as there may be several different ways the same string can be represented, and if your code expects only one of the several forms, it will not be matching as much as it really should be.

Another reminder is that files on unix are just bytes, any old bytes whatsoever, with some special rules for '\0' and '/' and the reserved files "." and "..". This means files could use several different encodings, or may even be random data with maybe only accidental encoding to them. I usually restrict filenames to a small subset of ASCII, because who wants that sort of excitement in their life?

/blog/2023/03/14/ascii.gmi

/blog/2023/10/01/funky-filenames.gmi

https://metacpan.org/pod/Unicode::Normalize