💾 Archived View for ser1.net › post › parallel-shell.gmi captured on 2023-07-22 at 16:26:54. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-01-29)

➡️ Next capture (2024-03-21)

-=-=-=-=-=-=-

I'm only posting it because (A) it's so ugly, (B) I'm certain it could be done more easily, and (C) it would have taken me no longer to write a program to do this.

I wanted to get some data from one application into another. The source data was a tsv file, and I needed only the first and third columns. To import it into the second program, it needed to be inserted as command-line line arguments; the target database is just sqlite, so I'm sure I could have just transformed and imported it, but it seemed straightforward to just run the command.

And it was. The trivial bash loop was:

<input.tsv while read -r line; do                                                        ~
  sparkle -a "$(echo "$line" | cut -f1)" $(echo "$line" | cut -f3)
done

However, there are 9,000 lines in the input, and `sparkle` takes about a second to process each command. I thought, "I should be able to parallize this easily, shouldn't I?"

Hah!

Ok, so the parallization isn't hard; just put an `&` at the end of the import line. But forking off 9,000 jobs in shell would make my laptop unhappy, so what I needed was a job pool. It turns out, there are a couple of ways to do this, and I looked at 3 of them:

The solution ended up being far more difficult than I expected, mainly because shell expansion happens *before* the solutions perform argument replacement. So if you, e.g., try something like:

echo 1 2 3 | parallel -j 3 printf "%d %d\n" $(({} * {})) $(({} * 5))

you'll get an error like:

zsh: bad math expression: illegal character {

If you need to mangle your input before executing a parallelized command on it, things start to get tricky.

BLUF: parallel

The solution I ended up with used parallel, but it was way more difficult than it should have been.

parallel -j10  --link buku --nostdin -a ::: $(<.surf/bookmarks cut -f1) ::: $(<.surf/bookmarks cut -f3 | tr ' \n' ',\0' | xargs -0 -i echo x,{})

What I'm doing here is using parallel's argument mixing. The `::: $(<.surf/bookmarks ...)` is the first argument list; the second `::: $(...)` makes the second list. The `--link` argument tells parallel to take one item from the first, and one item from the second, and run the command with them. Without `--link`, parallel will run the command with every permutation of the two lists, which is pretty cool, but not what I was looking for.

I had to make sure the second list never had any empty lines, though; frequently, the third column was missing in the input, so the `echo X,{}` ensures there's always *something* for parallel to consume from the second list.

It took me obscenely long to figure this out; I may have been able to do something with `bash -c`, and split the line up in the command; this *may* have worked:

parallel -j10 bash -c 'buku --nostdin -a $(echo $1 | cut -f1) $(echo $1 | cut -f3)' :::: .surf/bookmarks

moreutils-parallel

I didn't play with this too much; I got stuck pretty quick with how to break the arguments apart and focused on `parallel`. However, in retrospect the `bash -c` trick might work with this.

xargs

This was the most disappointing. First, the fact that xargs even *has* a `-P` option surprised me, but xargs behaved in odd ways, handling input differently depending on whether I used the `-P` flag or not. It was xargs that led me to the `bash -c` trick, but then I failed to get it working; arguments and lines would get swallowed, or inexplicably multiple lines would get joined together. I did spend some time on it, because the fewer tools I have to remember the better, but I never did get it working.

The point I'd start with if I come back to this is (again, I emphasize that it *doesn't* work):

<.surf/bookmarks cut -f1,3  --output-delimiter ' ' | tr '\n' '\0' | xargs -0 -L1 -P10 bash -c 'buku --nostdin -a $(echo $1 | cut -f1) $(echo $1 | cut -f3)'

Summary

The most important thing I learned from this is that I spent way, **way** too long trying to figure this out. I could have hacked a solution together in Go in under a half-hour, and while there's value to learning CLI tricks, I have other ways I'd rather spend my time.