💾 Archived View for log.pfad.fr › 2023 › debugging-sibgus captured on 2024-06-16 at 12:12:39. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-11-04)
-=-=-=-=-=-=-
I have been happily using Gokrazy on an old Raspberry Pi.
Go(ing)krazy on a Raspberry Pi (Model B)
One of the package that I use is a time series database called VictoriaMetrics (similar to Prometheus).
This setup runs pretty well and it gathers every minute:
However, when I repeatedly looked at some metrics (using Grafana), VictoriaMetrics sometimes encountered a fatal error:
fatal error: fault [signal SIGBUS: bus error code=0x1 addr=0x0 pc=0x6b2168] goroutine 111 [running]: runtime.throw({0xa8fb4f, 0x5}) /usr/lib/go/src/runtime/panic.go:1047 +0x4c fp=0x2d8ec78 sp=0x2d8ec64 pc=0x4dcc0 runtime.sigpanic() /usr/lib/go/src/runtime/signal_unix.go:834 +0xa8 fp=0x2d8ec94 sp=0x2d8ec78 pc=0x67ccc math.IsNaN(...) /usr/lib/go/src/math/bits.go:39 github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.removeEmptySeries(...) /home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/exec.go:180 github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.newBinaryOpFunc.func1(0x237af20) /home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/binary_op.go:95 +0x7e8 fp=0x2d8ed9c sp=0x2d8ec98 pc=0x6b2168 github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.evalBinaryOp(0x0, 0x2998280, 0x206f5e0) /home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/eval.go:410 +0x344 fp=0x2d8ee1c sp=0x2d8ed9c pc=0x6b8234 github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.evalExprInternal(0x0, 0x2998280, {0xd74640, 0x206f5e0}) /home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/eval.go:290 +0x314 fp=0x2d8eefc sp=0x2d8ee1c pc=0x6b6b5c github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.evalExpr(0x0, 0x2998280, {0xd74640, 0x206f5e0}) ...
The program restarted automatically and the next metric retrieval was fine. So there was a bug, but I could live with it (and I did for a time).
The panicking part of the code looks at all elements of a slice of []float64 and checks if all of them are Not a Number (NaN). The panic happens in the math.IsNaN function, which is a one-liner: return f != f
Reading a bit about SIGBUS, it means
that a process is trying to access memory that the CPU cannot physically address
Bus error on Wikipedia (CC BY-SA 4.0)
Looking up on the Internet, I saw no reported bug that looked like mine and decided to dig further. I made a clone of the VictoriaMetrics repo and tried "The most effective debugging tool" from Brian Kernighan, "Unix for Beginners" (1979).
The most effective debugging tool is still careful thought, coupled with judiciously placed print statements.
Adding a defer statement did not work, trying to print the value of the float64 also crashed before having the chance to have the value printed.
I also printed the address of the float64: it worked, but I didn't see anything suspect (the address looked legit).
Since the bug appeared to be low-level, I thought of generating a core dump. It wasn't trivial, because, by default, Gokrazy does not provide shell access out of the box. After installing breakglass and busybox, I was able to get a core dump, by running:
ulimit -c unlimited GOTRACEBACK=crash ./victoria-metrics
The hard part was analyzing the core dump. I couldn't easily install a debugger on my Raspberry Pi, so I transferred it to my main machine.
Looking at the stack traces and register values, I wasn't able to find anything interesting (it was even disturbing that the stack trace was not similar to the one printed during the panic).
The next day, it occurred to me that the Linux kernel might have more information regarding the crash. I ran dmesg and got a message looking like this (reconstructed from various Internet reports):
[1145.918454] Alignment trap: not handling instruction 6b2168 at [<b6fa4b40>] [1145.918455] 8<--- cut here --- [1145.918466] Unhandled fault: alignment exception (0x001) at 0x2eb4dbb [1145.918479] pgd = da1a2e09 [1145.918486] [2eb4dbb] *pgd=46c78835, *pte=00000000, *ppte=00000000
I then understood what was going on: the alignment exception was complaining about the address 0x2eb4dbb, which is 48975291 in decimal.
This number looks odd (pun intended), because a float64 is 8 bytes large, and its address should be a multiple of 8 (which 48975291 is obviously not).
I checked again using printf() debugging and sure enough the panic only happened when the float64 address was not a multiple of 8 (i.e., was misaligned).
Now that the bug was identified, I had to understand how it could be that a float64 was misaligned. The Go Specification is very clear, that float64 are guaranteed to be aligned by 8. So I thought that I had found a bug in the Go compiler, how exciting!
Size and alignment guarantees (The Go Programming Language Specification)
I prepared a detailed bug report and looked around for similar issues. I quickly found another issue for a sigbus fault on ARM.
cmd/link: sigbus/segfault on ARM using AzCopy · Issue #38331
At the end of my bug report, I added a link to this issue and wanted to add a reason why my report was different. I inspected closely the issue and saw that they were using unsafe pointers. Upon writing something like "related issue #38331: unsafe pointers are used", I thought that I should better check that my program wasn't using unsafe pointers...
I then saw that the []float64 slice was being constructed using unsafe pointers! It was quite sad that I didn't find a bug in the Go compiler, but I was quite happy that I could solve the issue myself. I adjusted the code to perform alignment adjustments in case of misalignment, and everything has been running smoothly since then!
Pull request to fix the alignment panic in VictoriaMetrics
Next time that I debug a low-level issue, I will remember to look at the kernel messages first (dmesg). It would have spared me a lot of trouble with generating and inspecting core dumps (which didn't help me in the resolution).
And remember to be extra careful when using the unsafe package (as the name helpfully reminds everyone).
📅 2023-03-09