Surprisingly, I am really fond of this device -- even though it is pretty much an undocumented hodgepodge of weird crap. However, it is _a lot_ of weird crap in there, and the CPU is interesting.
This Chinese module built by Espressif Systems uses a Tensilica XTensa LX-6 (LX-7 in some newer modules) CPU. I was disappointed but not surprised that Tensilica is a California IP company that collects royalties on its CPU design. As such, there is not much documentation.
A few years ago I picked up a dozen of these $5.00 ESP32 boards, but gave up in disgust when I could not get any documentation -- I refuse to program a computer with an undocumented CPU (unless it is to reverse engineer it, and I was not in the mood).
Today some outdated documentation is available: a large manual documenting the instructions of LX-6 (somewhat dated), a hardware reference manual documenting the mass of registers, and a few related manuals. Just enough to write assembly code.
Original Esp32 modules are actually tiny - maybe 15mm square boards complete with a 3-core CPU, around 260Kb of RAM and a few MB or flash. There is a ton of peripherals: analog and digital pins, timers, touch sensors, caches, memory manager, DMA, and even a WIFI module. The hardware reference manual is scary -- hundreds of pages with peripherals that keep coming.
There are two 32-bit general-purpose cores and a third, low-power, supervisory core. The main cores have 16 registers and RISC instructions that execute from a separate instruction memory; instructions are either 2- or 3- bytes long.
Espressif has a free gcc toolchain, and there is an Arduino toolchain available. Both work, but suck.
Arduino, well, is arduino. The build is a bit slow, and I hate it for hiding everything from the user, but I guess it's for 'artists' who want to build wearable LED-flashey things without understanding anything.
The main toolchain is a mess, and is _really_ slow, hides all kinds of headers in several hidden directories, and generally sucks. At the end of my investigation I made the mistake of upgrading to the bleeding edge version, which is even more opaque, uses a ton of python build routines (yuck) and has even more unreadable makefiles. It seems to focus on being a 'universal' build system, pulling components hidden god-knows-where with hidden build code.
When faced with this monstrosity, the obvious choice is to just start over, or at least start with a stripped-down system. Luckily, there are a couple of people on github who paved the way by developing minimal, 'bare-metal' systems.
Programming the beast in assembly is surprisingly easy, and even more surprisingly pleasant. It is very feasible to build a Forth-like system on this architecture.
Tensilica implements two types of calls: one is a normal RISCy call which leaves the return address in A0, and a bizzare windowed call which rotates registersaround, like some kind of MIPS processor on fentynil. The CPU actually has 64 registers, and the normal C abi has a 16-register window that moves by 8 registers during a call; the high 8 registers become the low 8 (with parameters), 8 more registers become available as a8-a15. On return, the operation is reversed, restoring low registers, shifting low regs up (with return values), and jumping via a0.
This turns to complete crap on overflow, which will happen in general-purpose code, especially if an operating system or recursive calls are involved. An interrupt is triggered and registers must be spilled and restored, making this really suck.
RISC processors have a problem loading addresses due to the fixed instruction size. LX-6 is no exception - to load a random address, the address needs to be stored elsewhere (but nearby) as a 32-bit value, and a jump instruction has to load the address as a PC-relative load using a small offset from PC (or another known location).
This means that general-purpose code is already doing a form of indirect threading, so theoretically a well-designed Forth should be almost as fast.
There are two Forth implementations that I found, but both seemed to be rather terrible, a factor of 10 slower than assembly/C. Probably because they are C-based and use the monstrous register-window scheme. I am pretty sure I can make my byte-token OctoForth to run at least as fast while taking up a fraction of the normal Forth memory footprint.
And so I went on a sleepless jag building Octo in about 3 days. Normally I use fasm which is not available for the platform, so I had to figure out how to make do with as and a couple of utilities that suck in symbol tables as part of the build...
While I got the interpreter running and managed to build a very basic (but not self-sufficient system), several issues slowed and eventually stopped me.
During initial familiarization, I found that I was unable to read data from the chip using the tools provided - some bytes were random. This happened immediately with a stock system/board, and I checked it with a few different boards with identical results.
Later, while implementing a minimal 'bare metal' C environment, I started experimenting with linker map files that included a part of a segment of data memory SRAM1, a 64K range of addresses $FFFE0000-$FFFEFFFF. It overlaps with some unused instruction memory, so it should work.
However my code failed _on load_. Further experiments showed that writing to address $FFFE01B5 immediately crashes the system! So I started the segment and $FFFE0200 and everything worked again...
Later on, this segment started giving me more problems -- not crashing, but a couple of addresses had incorrect values on load. Since by then I had a multi-stage loader that moved memory around, the problem was difficult to diagnose, and after staying up for more than 50 hours I gave up, for now.
I am not sure how to procede, and after sleeping it off and dealing with the usual serotonin crap, haven't touched the system -- I keep thinking that I will have a flash of brilliance. Not yet.
It is possible that my extremely abridged initialization fails to initialize some important part of the system - maybe the 2nd core is running, for instance, or some DMA is running wild.
As a first test I think I will try to see what memory normal linker maps use in Arduino and expressif toolchains, and see if I can isolate the memory in question using normal initialization.
It is possible that the problem is with the flash system - as I initialize it perhaps, making some locations screwy... Although no, writing the crashy RAM address has nothing to do with flash.
If everything else fails I can perhaps write my own RAM loader. This makes the system a little less usable as it would require an outside computer to boot it, making it not good as a controller. However I don't know how to update flash from RAM, or whether it's possible at all, making in-situ development questionable anyway.
I don't know why I like doing these stupid things. There is no way to prove that such a system will work -- even if mine works... And Espressif is probably phasing out the Tensilica-based ESP32s, replacing them with RISC-V, royalty-free CPUs anyway.
But it's kind of fun. Maybe.