Invisibles

gemini://ew.srht.site/en/2024/20241216-bad-hypen-breaks-man-page.gmi

I set ASCII by default in man(1) to help avoid fancy characters. Maybe the man pages are less pretty, but * for bullet works just as fine as •, which my vi helpfully shows as \xe2\x80\xa2. No, I don't do much with fancy characters in the terminal.

    $ grep -i asc ~/.kshrc
    alias man='/usr/bin/man -T ascii'

Ideally command line examples (and option flags and anything else one might reasonably copy and paste) would not be gussied up with Unicode or whatever, unless they do actually contain fancy characters. Another problem is various systems helpfully replacing quotes with smart quotes, which may not work too well in the shell, and may slow down the help, as you have to determine whether they actually used smart quotes at the terminal, or if the quotes were only mangled by their far too clever email program when they e-mailed you for help. But hey, that suavity looks good on you, Bob!

Others call for more Unicode everywhere, which brings a fistful of complexity—the Spaghetti Terminal, complete with American acting, Italian dircting, and West German funding—and a great potential for bugs and security issues, for maybe not much gain down at the system level. Someone even tried to support all numeric usernames, which opened up cans of worms all across the codebase (“if it's all numbers, it must be a userid, until someone with the best of intentions broke that promise”), and there was a fun comment of “they put the fancy character in my name into /etc/passwd, and then nobody could login because the login manager crashed when it encountered that character”. Luckily, ssh still worked. At the system level one may want punycode, and then higher level things can hopefully do the right thing with that? Or just stick to ASCII.

Space versus tab is well known, especially in sendmail.cf or Makefile; good thing we've learned that significant whitespace is a design error since those tools were written. Some editors do have means to show such invisible characters, or there's always od(1), process tracing, or other such debugging tools to show exactly what is going on. The humble carriage return can also be problematic, as it wipes out the line. A slow terminal would better show what is going on, as a human could maybe track the reset. Or some terminals have the option to log everything, which may be handy if you're developing something down at the terminal or curses level.

    $ perl -E 'say "foo\rbar"'
    bar
    $ perl -E 'say "foo\rbar"' | od -bc
    0000000  146 157 157 015 142 141 162 012
               f   o   o  \r   b   a   r  \n
    0000010
    $ perl -E 'say "foo\e[0Gbar"'
    bar

Lots of terminals use XTerm Control Sequences, or support at least some of them. Some terminals even had escape sequences that allowed arbitrary code to be executed (it might have been Eterm?) and did you remember to check what you copied off some CSS and JavaScript infested website before pasting it into a terminal? No? Well, it's probably okay. Most of the time.

A security system once normalized XML to prevent data leaks across a payments boundary, but neglected to handle whitespace. This would make it easy to exfiltrate data, as one could tab-or-space binary encode whatever, and send that along with the rest of the XML. Someone might notice if they knew about this trick, or if you encoded too much whitespace and thus triggered a “file is larger than expected?” warning.

    #!/usr/bin/env perl
    use 5.36.0;

    sub encode($string) {
        use bytes;
        my $enc = '';
        #printf "%vx\n", $string;
        for my $c ( split '', $string ) {
            $enc .= ( sprintf "%08b", ord $c ) =~ tr/01/ \t/r;
        }
        return $enc;
    }

    say encode("test");
    say encode("\N{BULLET}");

Alas, I am neither as motivated nor so money oriented as computer programmer Gus Gorman.