Today T, a fellow cow-order here at The Ft. Lauderdale Office of The Corporation, found a particularly nasty bug in “Project: Sippy-Cup [1]”—two identical SIP (Session Initiation Protocol) packets caused the request to fail. It uses UDP (User Datagram Protocol) for the network transport, so duplicate packets are a possibility (so are dropped packets, but that's already being handled properly), something that the code didn't handle properly.
Now, “Project: Sippy-Cup” is written in Lua [2]. I only say this because of how I handle requests. The main loop of the program waits for network packets using poll() [3] (or epoll_wait() [4] if running under Linux; they're like calls to select() [5] (which is all you ever see in Unix networking tutorials because everybody is basing the code off examples written in the early 90s but I digress) only less CPU (Central Processing Unit) intensive). When poll() returns, I either create a coroutine or resume a coroutine (depending on where we are in the request) to handle the request.
I did it this way, even though I have to manually switch coroutines, to make the processing of requests easier to follow:
>
```
function handle(request)
if not validate(request)
socket:send(request.sender,"400 Bad client! No cookie for you!")
return
end
socket:send(request.sender,"200 Good boy! You're such a good boy!")
local info = lookup_details(request) -- blocking call
local contact = dns.lookup(request.contact) -- blocking call
socket:send(contact,info)
local reply = socket:recv() -- blocking call
if not reply.okay then
log("I messed up")
end
end
```
rather than having all this logic spread across multiple functions where each function represents a step in the processing:
>
```
-- not shown, the code that receives packets and calls the appropriate
-- routine.
function handle_incoming(state)
if not validate(state.request)
socket:send(state.request.sender,"400 Bad client! No cookie for you!")
return
end
socket:send(state.request.sender,"200 Good boy! You're such a good boy!")
return lookup_details -- next function to call
end
function lookup_details(state)
state.info = -- code to send request
return contact_dns
end
function contact_dns(state)
state.contact = state.request.contact
-- code to send a DNS request
return got_dns_response
end
function got_dns_response
socket:send(state.contact,state.info)
return process_reply
end
function process_reply(state)
if not state.reply.okay then
log("I messed up")
end
end
```
It's fairly straightforward—once I get a SIP packet, I check a list to see if I've already created a coroutine based on the unique ID included in the packet. If one does exist, I then resume that coroutine; otherwise, I create one. The coroutine will run until it either finishes, or a function will block (say, a DNS (Domain Name Service) query), and the code that could block will add the coroutine to a list it maintains and yield back to the main code.
The main coroutine, at 50,000 feet:
Any routine that might block (and they're all network related) will put the coroutine on a list, and yield. So, when call dns.lookup(), that function does:
So, step 3 of the main coroutine contains these sub-steps:
The lookup_detail() function (the handling of which would be step 4, which isn't shown) is similar to the DNS lookup. This all works and while the main driver code is a bit complex (and the complexity has to live somewhere), the actual processing code is nice and linear, making it easier to change the “business logic” in one place rather than have it scattered throughout the codebase.
Now we get to the bug—mishandling of duplicate packets. The first SIP packet comes in, has a unique ID, so a coroutine is created, we get to the call to lookup_details() and we yield, waiting for a reponse from our database engine (not a real database engine, but close enough). The duplicate SIP packet comes in. We already have a coroutine associated with this SIP ID, so re blindly resume it. The coroutine, freshly resumed, assumes the reply from the database has arrived and things go down hill from there (but only for that request; the program is still running and can handle new requests).
The solution is to realize we have a duplicate request and not resume the coroutine in that case.
And like most bugs, the fix is trivial—it's the hunt that's difficult.
[3] http://pubs.opengroup.org/onlinepubs/009695399/functions/poll.html
[4] http://linux.die.net/man/4/epoll
[5] http://pubs.opengroup.org/onlinepubs/007908799/xsh/select.html