ESP32

Surprisingly, I am really fond of this device -- even though it is pretty much an undocumented hodgepodge of weird crap. However, it is _a lot_ of weird crap in there, and the CPU is interesting.

May you live in interesting times.

This Chinese module built by Espressif Systems uses a Tensilica XTensa LX-6 (LX-7 in some newer modules) CPU. I was disappointed but not surprised that Tensilica is a California IP company that collects royalties on its CPU design. As such, there is not much documentation.

A few years ago I picked up a dozen of these $5.00 ESP32 boards, but gave up in disgust when I could not get any documentation -- I refuse to program a computer with an undocumented CPU (unless it is to reverse engineer it, and I was not in the mood).

Today some outdated documentation is available: a large manual documenting the instructions of LX-6 (somewhat dated), a hardware reference manual documenting the mass of registers, and a few related manuals. Just enough to write assembly code.

ESP32

Original Esp32 modules are actually tiny - maybe 15mm square boards complete with a 3-core CPU, around 260Kb of RAM and a few MB or flash. There is a ton of peripherals: analog and digital pins, timers, touch sensors, caches, memory manager, DMA, and even a WIFI module. The hardware reference manual is scary -- hundreds of pages with peripherals that keep coming.

There are two 32-bit general-purpose cores and a third, low-power, supervisory core. The main cores have 16 registers and RISC instructions that execute from a separate instruction memory; instructions are either 2- or 3- bytes long.

Compiler support

Espressif has a free gcc toolchain, and there is an Arduino toolchain available. Both work, but suck.

Arduino, well, is arduino. The build is a bit slow, and I hate it for hiding everything from the user, but I guess it's for 'artists' who want to build wearable LED-flashey things without understanding anything.

The main toolchain is a mess, and is _really_ slow, hides all kinds of headers in several hidden directories, and generally sucks. At the end of my investigation I made the mistake of upgrading to the bleeding edge version, which is even more opaque, uses a ton of python build routines (yuck) and has even more unreadable makefiles. It seems to focus on being a 'universal' build system, pulling components hidden god-knows-where with hidden build code.

Burn it down

When faced with this monstrosity, the obvious choice is to just start over, or at least start with a stripped-down system. Luckily, there are a couple of people on github who paved the way by developing minimal, 'bare-metal' systems.

Programming the beast in assembly is surprisingly easy, and even more surprisingly pleasant. It is very feasible to build a Forth-like system on this architecture.

Idiotic Parameter Passing

Tensilica implements two types of calls: one is a normal RISCy call which leaves the return address in A0, and a bizzare windowed call which rotates registersaround, like some kind of MIPS processor on fentynil. The CPU actually has 64 registers, and the normal C abi has a 16-register window that moves by 8 registers during a call; the high 8 registers become the low 8 (with parameters), 8 more registers become available as a8-a15. On return, the operation is reversed, restoring low registers, shifting low regs up (with return values), and jumping via a0.

This turns to complete crap on overflow, which will happen in general-purpose code, especially if an operating system or recursive calls are involved. An interrupt is triggered and registers must be spilled and restored, making this really suck.

Forths

RISC processors have a problem loading addresses due to the fixed instruction size. LX-6 is no exception - to load a random address, the address needs to be stored elsewhere (but nearby) as a 32-bit value, and a jump instruction has to load the address as a PC-relative load using a small offset from PC (or another known location).

This means that general-purpose code is already doing a form of indirect threading, so theoretically a well-designed Forth should be almost as fast.

There are two Forth implementations that I found, but both seemed to be rather terrible, a factor of 10 slower than assembly/C. Probably because they are C-based and use the monstrous register-window scheme. I am pretty sure I can make my byte-token OctoForth to run at least as fast while taking up a fraction of the normal Forth memory footprint.

Octo on ESP32

And so I went on a sleepless jag building Octo in about 3 days. Normally I use fasm which is not available for the platform, so I had to figure out how to make do with as and a couple of utilities that suck in symbol tables as part of the build...

Flakiness and disaster

While I got the interpreter running and managed to build a very basic (but not self-sufficient system), several issues slowed and eventually stopped me.

During initial familiarization, I found that I was unable to read data from the chip using the tools provided - some bytes were random. This happened immediately with a stock system/board, and I checked it with a few different boards with identical results.

Later, while implementing a minimal 'bare metal' C environment, I started experimenting with linker map files that included a part of a segment of data memory SRAM1, a 64K range of addresses $FFFE0000-$FFFEFFFF. It overlaps with some unused instruction memory, so it should work.

However my code failed _on load_. Further experiments showed that writing to address $FFFE01B5 immediately crashes the system! So I started the segment and $FFFE0200 and everything worked again...

Later on, this segment started giving me more problems -- not crashing, but a couple of addresses had incorrect values on load. Since by then I had a multi-stage loader that moved memory around, the problem was difficult to diagnose, and after staying up for more than 50 hours I gave up, for now.

Plans

I am not sure how to procede, and after sleeping it off and dealing with the usual serotonin crap, haven't touched the system -- I keep thinking that I will have a flash of brilliance. Not yet.

It is possible that my extremely abridged initialization fails to initialize some important part of the system - maybe the 2nd core is running, for instance, or some DMA is running wild.

As a first test I think I will try to see what memory normal linker maps use in Arduino and expressif toolchains, and see if I can isolate the memory in question using normal initialization.

It is possible that the problem is with the flash system - as I initialize it perhaps, making some locations screwy... Although no, writing the crashy RAM address has nothing to do with flash.

If everything else fails I can perhaps write my own RAM loader. This makes the system a little less usable as it would require an outside computer to boot it, making it not good as a controller. However I don't know how to update flash from RAM, or whether it's possible at all, making in-situ development questionable anyway.

Takeaway

I don't know why I like doing these stupid things. There is no way to prove that such a system will work -- even if mine works... And Espressif is probably phasing out the Tensilica-based ESP32s, replacing them with RISC-V, royalty-free CPUs anyway.

But it's kind of fun. Maybe.

index

home