💾 Archived View for thrig.me › blog › 2023 › 10 › 30 › random-line-from-file.gmi captured on 2024-09-29 at 00:39:42. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-11-14)

-=-=-=-=-=-=-

Random Line From File

The problem here, assuming a unix file, is that one does not know how many lines there are, nor where the lines are. So one might think to get the number of lines, roll a random number in that range, then go back through the file to find and print that line number.

    awk -v line_idx=$(( RANDOM % $(wc -l "$1" | awk '{print $1}') + 1 )) 'NR == line_idx {print;exit}' "$1"

The original code lacked the ";exit" so would continue to loop over the remaining lines. Not very efficient.

Another option is to reservoir-sample the lines. See "The Art of Computer Programming, Volume 2, Section 3.4.2" (Donald E. Knuth) for details.

    perl -e 'rand($.) < 1 && ($p = $_) while <>; print $p'

An advantage here is that one need not know how many lines there are, so this code works with a stream. This algorithm can be ported to C or such if you want more speed. (The shell code could also work with a stream, but would need to save the stream to a temporary file, which would be even less efficient.)

Benchmark

Not very scientific; this would probably would need longer durations and more runs and on several different systems for better results. Higher rates are better.

    $ jot 10 > ten
    $ perl bench.pl ten
         Rate   sh   pl    C
    sh 70.7/s   -- -48% -80%
    pl  136/s  93%   -- -61%
    C   350/s 395% 157%   --

    $ jot 100 > hun
    $ perl bench.pl hun
         Rate   sh   pl    C
    sh 82.7/s   -- -42% -76%
    pl  141/s  71%   -- -59%
    C   348/s 321% 146%   --

    $ jot 1000 > thou
    $ perl bench.pl thou
         Rate   sh   pl    C
    sh 83.0/s   -- -33% -75%
    pl  124/s  50%   -- -63%
    C   332/s 300% 168%   --

    $ jot 10000 > tenk
    $ perl bench.pl tenk
         Rate   sh   pl    C
    sh 75.6/s   --  -1% -73%
    pl 76.7/s   1%   -- -72%
    C   276/s 264% 259%   --

It is somewhat interesting that the Perl gets as bad as the shell code as the number of input lines increases.

gshuf of the coreutils package is another option, but I have no idea what algorithm gshuf uses.

Source

bench.pl

randline.pl

randline.sh

tags #perl #unix