💾 Archived View for shit.cx › tech › uncategorised › 2020-11-29-understanding-udp captured on 2021-11-30 at 20:18:30. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
2020-11-29T20:17
A few days ago, I wrote about how I found Bash's UDP redirections were a little flaky and that I replaced it with OpenBSD Netcat. You may want to read that first if you missed it.
weirdness redirecting to /dev/udp
So after a few days in prod, I noticed a lot of restarts on those containers. The exit code was 137 indicating that bash was exiting due to a pipe failure.
Since all that changed was the use of netcat instead of a Bash UDP redirect, I was pretty sure I knew where the problem was.
The first thing I did was start a netcat server on localhost and smash it with a stream of data.
I quickly noticed that when I killed the server mid-stream, the client knew and exited. This broke my internal model of the workings of UDP. I expected that because UDP is connectionless and offers no guarantee of packet delivery, then the client would continue on it's merry way sending packets to a destination that was gone.
I first checked if the last packet showed anything different from the others… nope.
I then checked whether any packets came back to the source port… nope.
Then I asked someone I worked with whether they had any ideas and they were equally stumped.
Then something twigged. I remembered ICMP. I must have learnt it a long time ago but it was buried deep in my brain under all sorts of shit, yet surprisingly it came to me just when I needed it.
A quick search lead me to RFC 792¹. It explains that there are provisions in ICMP to signal when a destination is unreachable. I did another tcpdump to also include the ICMP protocol, and there it was, a destination unreachable message.
So what was happening was our StatsD service was sometimes becoming unavailable and notifying my service that the destination was unreachable. Netcat exited. This was caught by Bash's PIPEFAIL option. The script exits and Kubernetes restarts the container.
I modified the script to internally capture the error, restart netcat and log when this happens. If it's too frequently, we might need to sort out why StatsD keeps disappearing. This is a new problem we didn't know we had.
The joys of measuring shit, eh.
The content for this site is CC-BY-SA-4.0.