gemini - kennedy.gemi.dev

💾 Archived View for ser1.net › post › taco-bell-programming.gmi captured on 2024-03-21 at 15:13:44. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

--------------------------------------------------------------------------------

title: "Taco Bell Programming" date: 2010-11-25T08:56:00Z

--------------------------------------------------------------------------------

Ted Dziuba has an interesting article[1] on what he calls Taco Bell Programming; it's worth reading -- there is a lot of value in the concept he's promoting. I had some concerns about the practicality of the approach, so I ran some tests.

1: http://widgetsandshit.com/teddziuba/2010/10/taco-bell-programming.html

I produced a 297MB file containing 20 million lines of more-or-less random data:

ruby -e '20000000.times { |x| 
  puts "#{x}03#{rand(1000000)}" 
}' > bigfile.txt

Then I ran:

cat bigfile.txt |
  gawk -F '03' '{print $1, $0}' |
  xargs -n2 -P7 printf "%s -- %sn" > newfile.txt

This is a baseline; a data producer, such as pulling from a database, is going to produce data more slowly than `cat`, and `printf` is going to write lines more quickly than whatever we're doing to process the data. Here are the results:

┌───┬──────┐
│ P │ Time │
╞═══╪══════╡
│ 7 │ 68m  │
├───┼──────┤
│ 3 │ 88m  │
├───┼──────┤
│ 1 │ 241m │
└───┴──────┘

xargs was doing all of the work in this test, as far as `top` could tell, but the multiple processing helped. This confirmed my suspicions: xargs is having to spawn off a new Linux process for every line, which is not cheap. For comparison, a similar program written in Erlang (not known as being the fastest language in the world) was able to process the same amount of data, on a machine with half as many cores, in 20 minutes. In addition, the machine that it was running on was also: (1) running a process which was pulling data from a SqlServer database, itself, consuming 90% of a core, (2) inserting the results into a MongoDB database, and (3) running the MongoDB server that was being inserted into. So: half the resources, doing much more work, and it still runs over 3x as fast.

Ted's main point -- that code is a liability -- is still valid, and it's always useful starting out a project asking yourself how you could solve your problem with such tools. However, take such approaches with a grain of salt; if you can afford lackluster performance, it's probably a worthwhile solution. If performance is any sort of consideration, you may need to seek other solutions.