💾 Archived View for thrig.me › blog › 2024 › 03 › 01 › nul-in-filename.gmi captured on 2024-12-17 at 10:09:01. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2024-03-21)

-=-=-=-=-=-=-

Non-Terminal '\0' in Filenames

Generally a '\0' or the NUL byte cannot appear in the middle of a filename on unix. If it does, there is probably filesystem damage, or an error in the code. A filename with a non-terminal '\0' in it will be tricky to get through various functions, as NUL is the C string terminator.

Note that NUL or '\0' isn't exactly printable, so we have had to invent many other means of expressing it; "\x00" and "\000" and "NUL" and "'\0'" and doubtless you can find more. If you did have a filesystem that did support NUL inside filenames, you would need to ensure that '\0' and other such characters would be encoded in some form, could survive round-tripping through URLification or other serializations, and so forth. Probably it's easier to limit what characters can appear in a filename?

Theoretically, a filename could have '\0' in the middle of it, as, at least on OpenBSD, the dirent.h structure contains a "d_namelen" field that holds the length of the filename:

    $ grep d_nam /usr/include/sys/dirent.h | sed 3q
            __uint8_t  d_namlen;            /* length of string in d_name */
            __uint8_t  __d_padding[4];      /* suppress padding after d_name */
            char    d_name[MAXNAMLEN + 1];  /* name must be no longer than this */

However, if you try to print something with a NUL in the middle, or use it in some function that requires a C string, the filename is likely to be truncated:

    #include <stdio.h>

    char msg[] = {'H', 'e', 'l', 'l', '\0', ' ', 'W', 'o', 'r', 'l', 'd'};

    int main(void) {
        printf("%s\n", msg);
        int len = (int) (sizeof(msg) / sizeof(char));
        printf("%.*s (%d)\n", len, msg, len);
    }

hasnul.c

    $ CC=egcc CFLAGS=-g3 make hasnul && ./hasnul
    egcc -g3   -o hasnul hasnul.c
    Hell
    Hell (11)
    $ bt
    Reading symbols from hasnul...
    (gdb) l
    1       #include <stdio.h>
    2
    3       char msg[] = {'H', 'e', 'l', 'l', '\0', ' ', 'W', 'o', 'r', 'l', 'd'};
    4
    5       int
    6       main(void)
    7       {
    8               printf("%s\n", msg);
    9               int len = (int) (sizeof(msg) / sizeof(char));
    10              printf("%.*s (%d)\n", len, msg, len);
    (gdb) b 9
    Breakpoint 1 at 0x1a1b: file hasnul.c, line 9.
    (gdb) r
    Starting program: ./hasnul
    Hell

    Breakpoint 1, main () at hasnul.c:9
    9               int len = (int) (sizeof(msg) / sizeof(char));
    (gdb) p msg
    $1 = "Hell\000 World"

(The "bt" command is a custom utility that looks for the most recent core file or executable in a directory and loads that up in a debugger, as I often have a test directory full of executables and core files, for some reason or another.)

The point here is that the "msg" variable contains a '\0' in it, and GDB knows the correct sequence of bits, while on the other hand functions that operate on C strings probably will halt when they encounter a NUL character (or crash or who knows what, depending on the unix). So if there were a filename with NUL in it, you may need a tool that understands the structures involved, or you might be using a hex viewer on the raw disk contents. As for how a unix misbehaves if there is filesystem damage, who knows. It would probably manifest as "weird issues". A way to test this would be to corrupt filenames on a test system and then see what happens, assuming you have the time for that; usually in production I've instead been trying to restore services rather than to debug in detail why the NetApp or Linux fileserver was exhibiting wacky behavior.

(The NetApp way back when had a special file on it that if you did a "ls" or "find" or whatever in the directory containing that file, the NetApp would crash. Whoops?)

What code knows the length of a buffer besides a debugger? Higher level languages, or where the length is supplied due to, say, a database or network protocol that includes a length. Perl variables for example can contain "\0":

    $ perl -MDevel::Peek -e '$msg = "Hell\0 World"; Dump $msg'
    SV = PV(0xf376d2a4280) at 0xf380bf02a48
      REFCNT = 1
      FLAGS = (POK,IsCOW,pPOK)
      PV = 0xf376d2a8d20 "Hell\x00 World"\0
      CUR = 11
      LEN = 13
      COW_REFCNT = 1
    $ perl -E 'say "Hell\0 World"'
    Hell World
    $ perl -E 'say "Hell\0 World"' | od -bc
    0000000  110 145 154 154 000 040 127 157 162 154 144 012
               H   e   l   l  \0       W   o   r   l   d  \n
    0000014

The usual fix is to audit the input for non-terminating '\0' and to throw an error instead of passing the bytes on blindly. Whether every interface built on top of libc calls does this correctly is another matter, or how exactly is your SSL library handling X.509 records with NUL in them?

I recall a forth coder who wasn't so good at bash nor unix trying to figure out why '\0' in his strings was not getting through to the program that was being executed.