💾 Archived View for thrig.me › blog › 2023 › 10 › 01 › funky-filenames.gmi captured on 2024-12-17 at 10:04:24. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-11-14)

-=-=-=-=-=-=-

Funky Filenames

Filenames on unix can contain, well, almost anything. With an empty directory (so you can delete the entire directory if you get in over your head) create some filenamess with a bunch of whatever in them. This may fail, e.g. when "/" is rolled. A forward slash is one of the two illegal characters.

    $ perl -e 'open $x, ">", join "", map chr(1 + rand 255), 1..8'
    $ perl -e 'open $x, ">", join "", map chr(1 + rand 255), 1..8'
    $ perl -e 'open $x, ">", join "", map chr(1 + rand 255), 1..8'
    $ perl -e 'open $x, ">", join "", map chr(1 + rand 255), 1..8'
    $ perl -e 'open $x, ">", join "", map chr(1 + rand 255), 1..8'
    $ ls
    EC????b? M???=}?? X?K?H&?? c???W??? ??x???B?
    $ find . -ls
    10863654    8 drwxr-xr-x    2 jmates   jmates        512 Sep 30 14:47 .
    10863655    0 -rw-r--r--    1 jmates   jmates          0 Sep 30 14:47 ./xB
    10865278    0 -rw-r--r--    1 jmates   jmates          0 Sep 30 14:47 ./cW
    10865347    0 -rw-r--r--    1 jmates   jmates          0 Sep 30 14:47 ./Eb
    10865582    0 -rw-r--r--    1 jmates   jmates          0 Sep 30 14:47 ./M=}
    10865583    0 -rw-r--r--    1 jmates   jmates          0 Sep 30 14:47 ./XK     H&

Note how find(1) shows the inode number. This lets you target that file without naming it. If this is a deep filesystem tree, restrict the search to the current directory so that find does not waste time searching though subdirectories for the inode.

    $ find . -inum 10863655 -exec sh -c 'mv "$0" blah' {} \;
    $ ls
    EC????b? M???=}?? X?K?H&?? blah     c???W???
    $ find . -inum 10863655 -maxdepth 1
    ./blah

The same inode, by the way, can appear multiple times on the same filesystem. That's what a hard link is. Or, the inode number was reused on a different filesystem, in which case there is a flag to restrict find to the current filesystem.

One could also glob the files and let some other program rename them, to again avoid having to input tricky names on the command line:

    $ ls *
    EC????b? M???=}?? X?K?H&?? c???W???
    $ perl -e 'rename $_, ++$x for @ARGV' *
    $ ls
    1 2 3 4
    $ ls -il
    10865347 -rw-r--r--  1 jmates  jmates  0 Sep 30 14:47 1
    10865582 -rw-r--r--  1 jmates  jmates  0 Sep 30 14:47 2
    10865583 -rw-r--r--  1 jmates  jmates  0 Sep 30 14:47 3
    10865278 -rw-r--r--  1 jmates  jmates  0 Sep 30 14:47 4

Note how the rename(2) call does not change the inode. mv(1) uses rename(2) internally, if the source and destination files are on the same filesystem.

If you really do want the filenames, then you'll need a program that does not do anything funky with them. Some programming languages (TCL comes to mind) or tools insist on treating filenames as if they have an encoding, which on unix they may not, because filenames on unix can contain most anything.

    $ rm *
    $ perl -e 'open $x, ">", join "", map chr(1 + rand 255), 1..8'
    $ perl -e 'open $x, ">", join "", map chr(1 + rand 255), 1..8'
    $ perl -e 'open $x, ">", join "", map chr(1 + rand 255), 1..8'
    $ perl -e 'open $x, ">", join "", map chr(1 + rand 255), 1..8'
    $ perl -e 'open $x, ">", join "", map chr(1 + rand 255), 1..8'
    $ ls -li
    total 0
    10865347 -rw-r--r--  1 jmates  jmates  0 Sep 30 15:05 W???`?*g
    10865582 -rw-r--r--  1 jmates  jmates  0 Sep 30 15:05 juA????4
    10865278 -rw-r--r--  1 jmates  jmates  0 Sep 30 15:05 l???V:3r
    10865583 -rw-r--r--  1 jmates  jmates  0 Sep 30 15:05 ???嗢?h
    10863655 -rw-r--r--  1 jmates  jmates  0 Sep 30 15:05 ??0>? ??
    $ whatchar 嗢
    [嗢] Lo U+55E2 CJK UNIFIED IDEOGRAPH-55E2

Accidental Chinese! Something. Assuming UTF-8. Other encodings may vary. So what do the filenames contain, exactly? Let's print their contents as hex. Notice how ls(1) sorts the filenames while the glob does not, if you were expecting the inode numbers below to line up with those above.

    $ perl -e 'printf "%d %vx\n", (stat($_))[1], $_ for glob "*"'
    10865583 8e.ea.d7.e5.97.a2.da.68
    10863655 ef.bc.30.3e.c2.20.c.8e
    10865582 6a.75.41.a6.f2.a8.7f.34
    10865278 6c.d.11.96.56.3a.33.72
    10865347 57.9c.ec.89.60.b3.2a.67

If there are too many files for glob to deal with you can also use opendir/readdir (or even fts_open(3) or other such recursive options) but that's more typing. Typing in such names manually might still be tricky, which is why I'd usually first reach for the "find the inode number and rename that" method, or "mass rename the files with glob or opendir/readdir". (Glob is slow, if you have like a bazillion files to deal with.) There are various programs that can assist with mass file renames.

Maybe your keyboard input layer has a means to punch in arbitrary hex codes; otherwise, control+v (or whatever "literal next" is bound to) might let you type in some characters outside the printable range. Probably you'll instead want to use code that doesn't need the filename manually typed in, as shown above.

    $ stty -a | fgrep lnext
            erase = ^?; intr = ^C; kill = ^U; lnext = ^V; min = 1; quit = ^\;
    $ file *
    W`*g: empty
    juA4: empty
    V:3r: empty
    h: empty
    0>
       : empty
    $ perl -e 'print join "", map {chr hex} split /[.]/,shift' \
      ef.bc.30.3e.c2.20.c.8e | hexdump -C
    00000000  ef bc 30 3e c2 20 0c 8e                           |..0>. ..|
    00000008
    $ file "`perl -e 'print join "", map {chr hex} split /[.]/,shift' ef.bc.30.3e.c2.20.c.8e`"
    0>
       : empty

At some point you may want a standalone script instead of a too-long line laced with awkward shell quoting. Also note how various unix programs botch the funky filenames.

I'd recommend avoiding fancy characters in filenames; there was (is?) a longstanding bug in Subversion on Mac OS X related to Unicode: what happens when the same filename gets normalized differently by different Unicode libraries? Fancy here includes such characters as space, newline, CR, and many others in ASCII. These can cause problems for shell scripts. Usually I confine filenames to "A-Za-z0-9_.-" or maybe less as I'm not a fan of having to press the shift key too much.

There are uses for filenames with whatever in them, mostly as a way to practice and to see how your tools and code deal with them.

A fancy shell may be able to tab complete the funky filenames, in which case you may not need other tooling to deal with problematic filenames. Or you could click on things in a GUI, if there is one?

Formal Testing

Probably you will want files that exercise every possible character; random filenames are good but may miss specific characters problematic to some bit of code.

    $ perl -e 'open $f, ">", join "", map chr, 1..46, 48..255'

Mojibake in filenames might also be good to have, mixed encodings, etc. A problem here can be that either you need to avoid the shell interface (e.g. to use execve(2) instead of system(3)) or to quote things properly in or for the shell.

    $ perl -E 'say quotemeta for glob("*")'
    \\\\\\\\        \
    \
     \
    \\\\\\\\\\\\\\\\\\ \!\"\#\$\%\&\'\(\)\*\+\,\-\.0123456789\:\;\<\=\>\?\@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\`abcdefghijklmnopqrstuvwxyz\{\|\}\~\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
    $ perl -E 'say quotemeta for glob("*")' | od -bc | sed 4q
    0000000  134 001 134 002 134 003 134 004 134 005 134 006 134 007 134 010
               \ 001   \ 002   \ 003   \ 004   \ 005   \ 006   \ 007   \  \b
    0000020  134 011 134 012 134 013 134 014 134 015 134 016 134 017 134 020
               \  \t   \  \n   \ 013   \  \f   \  \r   \ 016   \ 017   \ 020
    $ perl -E 'say +(stat($_))[1] for glob "*"'
    10865278
    $ find . -inum 10865278 -exec sh -c 'mv "$0" blah' {} \;
    $ ls
    blah
    $ rm blah

tags #unix

P.S. I recall maybe it was an iTunes installer that deleted lots of files because of the old Apple habit of naming things "Macintosh HD". Whoops! Failure of programming, failure of QA, failure of education…

P.P.S. If you want to get at the inodes without going through the filesystem tree this usually requires a very low-level tool that knows how to find the inodes on the underlying media. Look for debugging and filesystem recovery tools.