💾 Archived View for thrig.me › tech › encoding.gmi captured on 2023-07-22 at 17:36:52. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-04-19)

-=-=-=-=-=-=-

Encoding

If all the terminals, etc, have been set to UTF-8, this may make it difficult to test other encodings. This document explores setting the terminal and other applications to use different encodings.

A glance at the manual for xterm(1) may show multiple encoding options.

-en encoding
This option determines the encoding on which xterm runs. It sets
the locale resource. Encodings other than UTF-8 are supported by
using luit. The -lc option should be used instead of -en for
systems with locale support.
...
-lc Turn on support of various encodings according to the users'
locale setting, i.e., LC_ALL, LC_CTYPE, or LANG environment
variables. This is achieved by turning on UTF-8 mode and by
invoking luit for conversion between locale encodings and UTF-8.
-- xterm(1) on OpenBSD 7.2

So -lc is recommended, but that performs a UTF-8 conversion. What if there's a bug in whatever luit is, especially if I'm feeding buggy inputs and do not really know what I'm doing? Let's try -en instead. A good practice is to supply some hopefully invalid value. If anything this will test how well the program screens for bad inputs.

    $ xterm -en lskadjflksjlfj &

This results in the message "Warning: couldn't find charset data for locale lskadjflksjlfj; using ISO 8859-1." which is good, as ISO 8859-1 is what we want to work with. When debugging encoding problems screenshots are fairly important in addition to showing the code and hex dumps of the output.

encoding/xterm-warning.png

    $ xterm -en iso-8859-1 &

This yields no warning, though "latin1" will. There are various names for various encodings that may or may not be supported. iconv(1) for instance does support something called LATIN1 but xterm warns on "latin1" or "LATIN1". Probably one could delve through the code for xterm to see what is going on, but that could be a fairly deep and irrelevant rabbit hole.

    $ iconv -l | grep -i latin1
    CP819 IBM819 ISO-8859-1 ISO-IR-100 ISO8859-1 ISO_8859-1 ISO_8859-1:1987 L1 LATIN1 CSISOLATIN1
    ISO-8859-16 ISO-IR-226 ISO8859-16 ISO_8859-16 ISO_8859-16:2001 L10 LATIN10
    RISCOS-LATIN1

An alternative to xterm(1) would be to write software that displays text; this may have the advantage of containing much less code than xterm does, so in theory any problems should be easier to debug. Another advantage of multiple implementations is that bugs may be more apparent--if one program but not another misbehaves, then the problem is likely in that one program. With only one program, who knows if what it is doing is expected.

    $ doas pkg_add tk
    ...
    $ cat show-encoding
    #!/usr/bin/env wish8.6

    puts "default: [encoding system]"
    encoding system iso8859-1
    puts "new: [encoding system]"

    font create fff -family Times -size 32
    pack [label .xxx -font fff -textvariable display]
    bind . <q> exit

    set fh [open input]
    gets $fh display
    close $fh

    vwait godot
    $ cat input
    Bchar
    $ wish8.6 show-encoding
    default: utf-8
    new: iso8859-1

encoding/show-encoding.png

Various fonts may not be able to display various symbols, possibly replacing them with empty blocks. The "noto"--no tofu--fonts may be good to test with.

Note that the terminal above was configured for UTF-8; the input file was displayed as "Bchar", which disagrees with the actual content of the file. This is where a hex dumper can be of service to show the actual contents.

    $ file input
    input: ISO-8859 text
    $ od -t cx1 input
    0000000    B 374   c   h   a   r  \n
              42  fc  63  68  61  72  0a
    0000007
    $ iconv -f iso-8859-1 -t utf-8 input
    Büchar

This page is written in UTF-8, though with encoding problems screenshots will help verify what is going on.

encoding/some-commands.png

Suitable test data must be obtained; the B\xc3\xbcchar came from the test code for the Perl URI module. Look for data with bytes in the 128-255 range, as this so-called "extended ASCII" space is used for different things by different encodings. Other useful characters may be those in the four-byte range for UTF-8, which some software may have trouble with, e.g. older versions of MySQL, or where database tables still use utf8mb3.

ASCII

xterm started with -en iso8859-1 is more likely to display the contents of the input file correctly, though this in turn depends on the program used.

encoding/iso8859-1-xterm.png

    $ cat input
    Büchar
    $ less input
    B<FC>char
    $ locale
    LANG=en_US.UTF-8
    ...

The locale is set to UTF-8; this will need to be rectified to use ISO-8859-1, and there may be additional configuration of database tables and so forth to use a particular encoding. All the layers will need to be set correctly, which is probably why most folks set everything to UTF-8 and thing forget about it.

encoding/perl-encoding.png

Also be sure to export LANG and other LC_* environment variables so that other programs can actually see those variables. This is usually done from a shell rc file.

    $ unset LANG
    $ LANG=en_US.UTF-8
    $ perl -E 'say $ENV{LANG}'

    $ LANG=foo perl -E 'say $ENV{LANG}'
    foo
    $ echo $LANG
    en_US.UTF-8
    $ export LANG
    $ perl -E 'say $ENV{LANG}'
    en_US.UTF-8

Not a large roadblock, but many such small roadblocks along the way will not help.

all the files

Back to tech index

tags #debug #perl #tcl #encoding