Sixteen hours to change three lines of code

The bug report was simple enough: “Project: Sippy-Cup [1]” would, when it had to retransmit the data (remember, we're running over UDP), it would send an additional packet.

Table: Data flow for “Project: Sippy-Cup”
Sippy Cup	direction	Client
------------------------------
DATA	→	(packet dropped)
DATA	→	(packet received)
(packet recieved)	←	ACK
DATA	→	 

So far, the bugs found in “Project: Sippy-Cup” have been easy to reproduce (thankfully!) and easy to fix once identified. This one was easy to reproduce … and took two days of intense effort to fix. The bug is not a show-stopper, and the client should (there's that word) handle the duplicate packet. My manager, S, was subtly hinting that I should put this on the backburner, but this is one of those problems that just irks me.

I added logging statements. I instrumented the code. I traced the execution of the program line-by-line (easy enough to do within Lua [2] as there's a pre-defined method for hooking into the execution engine, and even though this didn't reveal the root cause of the bug, it did reveal a few places where some simple changes optimized the code (but that's for another post [3]). For example, here's all the code that's executed [4] when you run the luarocks [5] command, but fair warning—it's nearly 7,000 lines long) and it became clear that something was wrong with the timeout handling (what I didn't mention the other day about scheduling coroutines [6] is how I handle timeouts, but alas, that's for another post [7] as well). The code is basically:

>
```
request = receive_packet_from_client()
send_ack()
send_request_to_db(request)
-- ------------------------------------------------------------------
-- This sets things up to force the coroutine to resume in one second,
-- returning a result of nil.
-- ------------------------------------------------------------------
timeout(1)
-- ----------------------------------------------------------------
-- the code that resumes this coroutine will pass in the result in
-- coroutine.resume(), which is assigned to dbresult
-- ----------------------------------------------------------------
dbresult = coroutine.yield()
-- ------------------------------------------------------------------------
-- cancel a pending timeout, if any. If we did timeout, this is a NOP. If
-- we haven't timed out, this prevents the timeout from triggering.
-- ------------------------------------------------------------------------
timeout(0)
if not dbresult then
info = failed()
else
info = extract(dbresult)
end
for retries = 1 , 3 do
send_information_to_client()
timeout(1)
ack = coroutine.yield()
timeout(0)
if ack then
return
end
end
```

I was able to determine that the initial “timeout” was the request to the database engine timeout trigger (after hacking in a way to track the timeout requests) which meant the timeout code wasn't working properly. Once that was clear, I was able to locate the problem code:

>
```
-- ------------------------------------------------------------------------
-- TQUEUE contains references to coroutines that want a timeout. If we set
-- the timeout to 0, this just sets the awake field to 0, which informs the
-- coroutine scheduler not to run the coroutine when it "times out".
-- ------------------------------------------------------------------------
function timeout(when)
local f = coroutine.running()
if when == 0 then
if TQUEUE[co] then
TQUEUE[co].awake = 0 -- oops, bug. See this post [8] for details
end
else
TQUEUE:insert(when,f)
end
end
```

Here we have an instance where a static compiler would have flagged this immediately [9]. I either need to rename f to co, or change co to f. A simple two line fix (and I really need to find a code checker for Lua).

The other issue, now quickly found, was this bit of code:

>
```
local function handler(info,err,remote,message,startime)
...
local ninfo,err,nremote,nmessage = coroutine.yield()
timeout(0)
if err == 0 then
if not message.request.status then
-- error
else
return nmessage
end
else
...
```

Even a static compiler would not have found this issue—I was checking the wrong variable! The code should read:

>
```
if not nmessage.request.status then
```

Bad me for using a bad name for a variable. No cookie for me!

Okay, this is a simple one line change. Total lines changed: 3.

[1] /boston/2014/03/05.1

[2] http://www.lua.org/

[3] /boston/2015/02/13.1

[4] /boston/2015/02/12/luarocks-run.txt

[5] https://rocks.moonscript.org/

[6] /boston/2015/02/05.1

[7] /boston/2015/02/16.1

[8] /boston/2015/02/16.1

[9] /boston/2015/02/09.1

Gemini Mention this post

Contact the author