A few days to change what amounts to about three lines of code

I spent a few days tracking down a tough bug in “Project: Sippy-Cup [1]”. The bug wasn't in the happy path [2] (explained here [3]) but the “unhappy path” that happens when a timeout occurs [4] [memo to self: check the blog before spending a few hours rewriting what you previously wrote].

We were able to reproduce the error rather quickly. What took the time was trying to isolate the root cause, and what I should have done [memo to self: reread my previous memo to self] was enable the statement logging [5] I did a few years ago. Doing so allowed me to find the root rather quickly.

Each request is handled in a coroutine (think “very cheap thread”). The coroutine will acknowledge the incoming SIP (Session Initiation Protocol) message, extract the data we need from said message and send it to the “Business Logic Unit.” But because we have a limited amount of time to do that voodoo that we do (else we exceed our SLA (Service Level Agreement) and have to pay large amounts of money to The Monopolistic Phone Company), we will only wait for a limited period of time for the reply from the “Business Logic Unit.” In either case, we then send out a SIP message and wait for an acknowledgement that it was received.

The happy path—nothing times out and everybody is happy. The unhappy path is when the reply from the “Business Logic Unit” is late, and that late reply was somehow showing up as the “acknowledgement,” which my code didn't handle gracefully at all!

So normally the program does:

And the unhappy path was supposed to go like:

So really, not much difference. But in reality things were different:

The fix, as always, was easy—if a coroutine is scheduled to run, don't reschedule it. What's not so easy is finding the root cause.

[1] /boston/2014/03/05.1

[2] https://en.wikipedia.org/wiki/Happy_path

[3] /boston/2015/02/05.1

[4] /boston/2015/02/16.1

[5] /boston/2015/02/12.1

Gemini Mention this post

Contact the author