I recently spent far too long fighting a pernicious, and randomly occurring, but in the scheduler/context switcher code for my AVR operating system, AVRoxide[1]. The bug was caused by two things - a good old fashioned mistake, and also my not properly understanding the difference in the way the ATmega4809[2] processor handles interrupts compared to its predecessors.
When I was trying to work out what was going wrong, Google found remarkably little helpful information - so I'll write up the explanation here in the hope it may help someone else when they Google "why are interrupts weird on the ATmega" :-).
So, essentially the bug in my code was in the context-restoring assembler code. For the uninitiated, a context save is where we save the entire state of the processor - all the registers etc. - some place safe, and a context restore is where we load them back again, restoring the processor state to exactly what it was when it was saved. A context *switch*, where we switch from one thread to another, is essentially just "context save the current thread, then context restore the one we are switching to".
The bug in my code was that my context restore function was obediently restoring the `SREG` status register (as it should.) Unfortunately, I was overlooking the fact that on the AVR, the global "interrupts enabled" flag is part of `SREG`, meaning that as soon as I restored it I was re-enabling interrupts - in the middle of the context restore routine, which is something you most definitely *do not* want interrupted.
Dumb, of course. But you'd think I would have found it pretty quickly... Except I didn't, because *it actually worked fine* on the ATmega4809. In most of my operating system, context switches occur inside interrupt service routines anyway - and I never noticed this bug. It also worked fine on the ATmega328P. Only when I introduced "context switching in userland" (i.e. threads yielding, *outside* of an interrupt), did weird things start to happen.
This code should have broken from day one. But it didn't... Why not?
Essentially, what's different is this: The ATmega4809 (and I guess other "zero series" AVRs) *actually doesn't pay any attention to the global interrupt enable bit while it's inside an Interrupt Service Routine*:
On the older AVRs, when you enter an ISR, the chip clears the global interrupt
enable flag in `SREG`. This is what stops interrupts interrupting themselves.
Then, when you exit the ISR, the `reti` instruction sets the interrupt enable
flag again.
This is why the bad code worked fine on the '328P. OK, it was actually restoring `SREG` badly (including the interrupt enable flag), but since interrupts were always disabled when the context was *saved* (because they were saved from within an interrupt), it never re-enabled by accident when the context was loaded.
But on the newer "zero series" AVRs:
The ATmega4809 *does not* clear the interrupt enable bit when it enters
an interrupt service routine!
OK, but in that case, how does the processor know not to interrupt an interrupt? There is a *separate* flag, `CPUINT.STATUS`, that indicates whether or not the processor is in an ISR, and *this* is what blocks interrupts from interrupting themselves.
So, on the '4809 this code worked for a *different* reason: In fact, I was incorrectly restoring the interrupt-enabled flag, but it didn't matter because as long as I was inside an interrupt the processor was ignoring it anyway.
There is a corollary note to this:
On the ATmega4809, the `rti` instruction also *does not* set the interrupt
enable bit. In fact, it just clears the relevant `CPUINT.STATUS` bit.
This is a pretty subtle but important change in the way interrupts are handled. We are used to assuming that `reti` can be effectively used as an atomic "enable interrupts and return" instruction - but on the zero-series devices, this is no longer true.
This stuff *is* documented in the datasheets. But it's not exactly "called out" in them. It's a very subtle change in behaviour, and it may be that you never notice... But if you do, you could be banging your head against a wall for a couple of days.
In fact, it means that the ATmega4809 - and I guess other zero-series chips - behaviour explicitly contradicts the AVR Instruction Set Manual, which states that:
*RETI - Return from Interrupt*
Returns from the interrupt. The return address is loaded from the STACK, and the Global Interrupt Enable bit is set.
For clarity - I say again... This is *not true* on the ATmega4809. `reti` has no effect on the Global Interrupt Enable bit on these chips, and actually operates on a different register, `CPUINT.STATUS`.
This is the sort of thing which Microchip could really do a better job of calling out in the datasheets - but anyway, consider it a lesson learned. Now I know, so do you.
Links for anyone wanting to dig further:
2: https://www.microchip.com/en-us/product/ATMEGA4809
4: https://ww1.microchip.com/downloads/en/DeviceDoc/ATmega4808-4809-Data-Sheet-DS40002173A.pdf
5: https://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-avr-instruction-set-manual.pdf
--------------------