💾 Archived View for ser1.net › post › taco-bell-programming.gmi captured on 2024-12-17 at 10:00:10. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2024-03-21)
-=-=-=-=-=-=-
--------------------------------------------------------------------------------
title: "Taco Bell Programming" date: 2010-11-25T08:56:00Z
--------------------------------------------------------------------------------
Ted Dziuba has an interesting article[1] on what he calls Taco Bell Programming; it's worth reading -- there is a lot of value in the concept he's promoting. I had some concerns about the practicality of the approach, so I ran some tests.
1: http://widgetsandshit.com/teddziuba/2010/10/taco-bell-programming.html
I produced a 297MB file containing 20 million lines of more-or-less random data:
ruby -e '20000000.times { |x| puts "#{x}03#{rand(1000000)}" }' > bigfile.txt
Then I ran:
cat bigfile.txt | gawk -F '03' '{print $1, $0}' | xargs -n2 -P7 printf "%s -- %sn" > newfile.txt
This is a baseline; a data producer, such as pulling from a database, is going to produce data more slowly than `cat`, and `printf` is going to write lines more quickly than whatever we're doing to process the data. Here are the results:
┌───┬──────┐ │ P │ Time │ ╞═══╪══════╡ │ 7 │ 68m │ ├───┼──────┤ │ 3 │ 88m │ ├───┼──────┤ │ 1 │ 241m │ └───┴──────┘
xargs was doing all of the work in this test, as far as `top` could tell, but the multiple processing helped. This confirmed my suspicions: xargs is having to spawn off a new Linux process for every line, which is not cheap. For comparison, a similar program written in Erlang (not known as being the fastest language in the world) was able to process the same amount of data, on a machine with half as many cores, in 20 minutes. In addition, the machine that it was running on was also: (1) running a process which was pulling data from a SqlServer database, itself, consuming 90% of a core, (2) inserting the results into a MongoDB database, and (3) running the MongoDB server that was being inserted into. So: half the resources, doing much more work, and it still runs over 3x as fast.
Ted's main point -- that code is a liability -- is still valid, and it's always useful starting out a project asking yourself how you could solve your problem with such tools. However, take such approaches with a grain of salt; if you can afford lackluster performance, it's probably a worthwhile solution. If performance is any sort of consideration, you may need to seek other solutions.