💾 Archived View for tilde.cafe › ~stack › gemlog › 2022-10-13.InsideOcto1.gmi captured on 2024-08-31 at 13:05:53. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-07-22)
-=-=-=-=-=-=-
OctoForth is a Forth-inspired threaded interpreter with a unique feature: 8-bit tokens that are magically not limited to a fixed set of 256 meanings.
Forth code consists of lists of tokens which are executed in sequence. Normally tokens are addresses -- of other lists of tokens. To interpret this tokenized code, Forth grabs a token, and goes wherever it points, to that other list, grabs a token, and so on. Eventually a token points to a CODE word containing machine code to execute, and the interpreter goes back up for the next token.
The interpreter spends most of its time threading up and down lists of tokens, occasionally doing some real work. True, this is wasteful in both time and space, but hey, it works, and it is surprisingly compact and fast.
There is one extra detail I glossed over. Every time Forth enters a new list of tokens, it uses the first datum to somehow figure out what that list is -- is it a CODE word to execute? Another list to traverse? A variable?
So every step actually involves an indirection to get the proper interpreter, which then threads the list of tokens that follow.
When Forth was discovered, tokens were usually 16 bits, to match the address space of PDP-11's, 6502's and Z80's. Later on, tokens were expanded to 32 bits for 68000's and i386s in protected mode.
There were some attempts, early on, to use smaller, 8-bit tokens. However, this involves yet another indirection to transform the 8-bit token to the proper address, and limits the system to a maximum of 256 possible addresses to execute. In a tiny Forth, that is almost enough to do something useful, but not enough for general-purpose work. Some thought of it as a VM, and compiled code to use only the tokens, as if the 8-bit system was an actual processor.
OctoForth also uses 8-bit tokens, but it suffers from none of the above limitations. It cleverly avoids the extra indirection by making the assumption that most words are in fact token lists. It allows us to encode any address in a large system into 8 bits by cleverly using a 256-slot window into a larger table of addresses.
Memory is cheap, you say. Why bother with such contortions?
The answer is, oddly enough, efficiency. Modern processors rely on caches to maintain reasonable speed -- accessing actual memory is perhaps a hundred times slower than reading from a cache. Organizing code and data in a cache-efficient manner is a black art.
By using 8-bit tokens, OctoForth words are absurdly compact and dense. Many, sometimes dozens of words fit into a single line of cache. This is a clear win.
On the other hand, the table of addresses packs 16 such addresses into a single line, which is a good start. But it gets better: OctoForth automatically and without any extra effort, maximizes the locality of access.
As words are defined in Forth, they are generally used shortly thereafter. The table reflects this arrangement, and it is very likely that the addresses that tokens refer to are near each other in the table, thus falling inside already-cached lines.
And so, even with todays memories of tens of Gigabytes, it is still a good thing to have subroutines measured in tens of bytes!
Like many interesting things in life, the OctoForth interpreter is amazingly simple, consisting of a handful of instructions, and simultaneously maddeningly complicated. I've spent almost 10 years messing with it, and I am still discovering new and weird things about it.
The basic outline of what the interpreter does is very simple:
Wait, how does it return? Simple: a <0> token is reserved to mean RETURN, when it is not the very first token in a list. When it is the very first token, it means that what follows is machine code to be executed.
That's all there is to the inner interpreter.
The term `tabe` refers to the base of the currently-appropriate table area. It is a tribute to a dear friend who recently died.
Obviously, the secret sauce is in how we 'recover the indirection table from IP'!
There are many ways to do so, but OctoForth uses a simple and generous scheme: it keeps a parallel memory space of the same size as the dictionary, containing 32-bit slots. For any given IP, there is a corresponding (but overlapping) 256-slot area in table-space.
As IP is incremented or changed, so is the corresponding 256-slot area in this other table memory. To transform the IP to tabe, we simply align it to a 4-byte boundary, and somehow coerce it into this other area.
OctoForth keeps the dictionary at address $30000000 and the tables start at $20000000. The transformation function from IP to table base is a single assembly instruction:
and IP,$2FFFFFFC
This one instruction simultaneously moves the pointer from its $3xxxxxxx to the $2xxxxxxx space, and clears the two low bits to guarantee slot alignment.
Observe that, nearby code areas share large portions of their tables! Or to look at it another way, for every 4 tokens, the tabe 'slides' up one slot, retiring the oldest slot and providing a new slot, while sharing the rest of the slots and hopefully profiting from this arrangement.
Now that it is obvious how the interpreter gets values out of the table-space, the question is: how do values get there in the first place?
During compilation, of course. Normally, Forth compiles words with its `,` word, which simply compiles 32-bits at HERE and increments it by 4. OctoForth adds another compiling word `ref,` which means 'compile a reference'. This results in an 8-bit token compilation, a token which refers, via its indicated slot, to a position in the table, which contains the desired address.
The process goes like this: we get the tabe for the IP+1 (because we haven't yet compiled the token at IP!). We search the tabe's 256 slots for the desired address. If it's there, we compile the index/token.
If the address is not in the table, we need to put it there. Ideally, we should place it close to the top of the table, so that subsequent code can see it for as long as possible (since it's not there, we know that code preceding us has no use for it). So, starting at slot $FF we scan down, looking for the first empty spot. We fill it with the address, and compile the index/token.
Remember, this is done only when we are compiling, and efficiency is not an issue here - even when we compile from files, the drive can't keep up with our compilation speed, even with all this table searching.
OctoForth indirection scheme is an example of a probabilistic data structure (and perhaps my favorite data structure). It relies on statistics to accomplish what otherwise would look like magic - stuffing 10 pounds of **it into a five-pound bag, as the saying goes.
Here are some things to consider:
If you still doubt how it's possible to represent 32 bits with only 8, consider that it's not really 8. We average 16, if we count the table. And we also use the address itself as part of the encoding, getting even more entropy.
What if there is no empty slot, you ask? We have a problem, a tabe overflow. This is very unlikely, because every 4 tokens we get a new, empty slot at the top of the tabe. And our tabe already contains a lot of useful targets.
But what happens if we do overflow? Well, it is a compile-time error condition. But one that we can recover from, simply by restarting the compile slightly higher up in the memory. How much higher? For every 4 bytes up we get a new slot, so we could try 4 bytes higher, or maybe 16 just to be safe.
The opposite is underflow, and that is actually much more likely. It is not an error, but it is a waste of table space - slots may go unused. I just shrug it off... Early on I used a denser scheme which allowed for 8 tokens for every slot, and even 16. This cut the size of the table by 2 or 4, yielding per-token average of 12 or even 10 bits, but I did occasionally encounter overflow.
With OctoForth, I went for a 4:1 ratio, which is generous. I haven't hit overflow yet. If I find that statistics call for a denser packing, I will take care of it later, as it will improve cache utilization.