💾 Archived View for thrig.me › blog › 2022 › 10 › 25 › shell-while-loop-considered-harmful.gmi captured on 2024-07-09 at 00:53:46. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Shell while Loop Considered Harmful

    $ printf 'one\ntwo' | while read l; do echo "$l"; done
    one
    $

Whoops, silent data loss. POSIX requires that text files end in an ultimate newline to be considered text files, but in practice that ultimate newline may be absent and then...yeah. How many shell scripts are out there and how many ultimate lines have been lost this way?

A less terrible version is, well, verbose.

    $ printf 'one\ntwo' | while IFS= read -r l || [ -n "$l" ]; do printf '%s\n' "$l"; done
    one
    two

And, slow. Do you want under a second (C, Perl), or seven seconds (ksh)?

    $ perl -E 'say for 1..1_000_000' | time ./shellwhile >/dev/null
        0m07.34s real     0m03.48s user     0m03.80s system
    $ perl -E 'say for 1..1_000_000' | time ./perlwhile >/dev/null
        0m00.71s real     0m00.45s user     0m00.25s system
    $ perl -E 'say for 1..1_000_000' | time ./cwhile >/dev/null
        0m00.58s real     0m00.03s user     0m00.11s system

Typically the shell will be a order, or orders, of magnitude slower, especially when it forks external tools; the above uses the shell internal echo to make the numbers for the shell less bad. echo is not portable nor safe for random input, but the portable printf(1) involves a fork which would make the shell performance even worse.

Okay for fast prototyping, terrible for most anything else.

Therefore, if shell code will have something tricky and non-performant like a while loop in it, I generally write that code in some other language.

The Benchmark Code

    $ cat shellwhile
    #!/bin/ksh
    while IFS= read -r l || [ -n "$l" ]; do echo "$l"; done

    $ cat perlwhile
    #!/usr/bin/perl
    print while readline;

    $ cat cwhile.c
    #include <stdio.h>
    int
    main(void)
    {
            char *line      = NULL;
            size_t linesize = 0;
            ssize_t linelen;
            while ((linelen = getline(&line, &linesize, stdin)) != -1)
                    fwrite(line, linelen, 1, stdout);
            return 0;
    }

    $ cat lispwhile.lisp
    (defun main ()
      (loop for line = (read-line *standard-input* nil nil) while line do
        (write-line line)))
    (sb-ext:save-lisp-and-die "lispwhile" :executable t :toplevel 'main)

Pattern Recognition

Commands or groups of commands that are often run should probably be rewritten to be more efficient; consider counting the most frequent of input lines, which might realistically be some portion of a logfile:

    $ printf 'a\nb\na\na\nb\nc\n' | sort | uniq -c | sort -nr
       3 a
       2 b
       1 c

This gets the job done, but is slow. Somewhat faster is to place all the lines into a hash of line => count pairs, and then to sort by the count, all within a single process. More efficient, but you have to actually notice the pattern, worry about the CPU waste, and then write a specific tool for it.

            Rate shell  perl tally
    shell 88.0/s    --  -41%  -61%
    perl   149/s   70%    --  -34%
    tally  227/s  158%   52%    --

https://thrig.me/src/scripts.git

tags #ksh #c #sh #perl #lisp