💾 Archived View for thrig.me › blog › 2023 › 10 › 30 › random-line-from-file.gmi captured on 2023-11-04 at 11:22:37. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
The problem here, assuming a unix file, is that one does not know how many lines there are, nor where the lines are. So one might think to get the number of lines, roll a random number in that range, then go back through the file to find and print that line number.
awk -v line_idx=$(( RANDOM % $(wc -l "$1" | awk '{print $1}') + 1 )) 'NR == line_idx {print;exit}' "$1"
The original code lacked the ";exit" so would continue to loop over the remaining lines. Not very efficient.
Another option is to reservoir-sample the lines. See "The Art of Computer Programming, Volume 2, Section 3.4.2" (Donald E. Knuth) for details.
perl -e 'rand($.) < 1 && ($p = $_) while <>; print $p'
An advantage here is that one need not know how many lines there are, so this code works with a stream. This algorithm can be ported to C or such if you want more speed. (The shell code could also work with a stream, but would need to save the stream to a temporary file, which would be even less efficient.)
Not very scientific; this would probably would need longer durations and more runs and on several different systems for better results. Higher rates are better.
$ jot 10 > ten $ perl bench.pl ten Rate sh pl C sh 70.7/s -- -48% -80% pl 136/s 93% -- -61% C 350/s 395% 157% -- $ jot 100 > hun $ perl bench.pl hun Rate sh pl C sh 82.7/s -- -42% -76% pl 141/s 71% -- -59% C 348/s 321% 146% -- $ jot 1000 > thou $ perl bench.pl thou Rate sh pl C sh 83.0/s -- -33% -75% pl 124/s 50% -- -63% C 332/s 300% 168% -- $ jot 10000 > tenk $ perl bench.pl tenk Rate sh pl C sh 75.6/s -- -1% -73% pl 76.7/s 1% -- -72% C 276/s 264% 259% --
It is somewhat interesting that the Perl gets as bad as the shell code as the number of input lines increases.
gshuf of the coreutils package is another option, but I have no idea what algorithm gshuf uses.
tags #perl #unix