💾 Archived View for elektito.com › gemlog › 17-starforth-2 captured on 2024-06-16 at 12:15:33. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-07-10)
-=-=-=-=-=-=-
This article is part of an ongoing series on writing a self-hosting compiler. You can find the first installment here:
Starforth: A Minimal Self-Hosting Compiler
This post goes over how I shaped up the first version of the compiler. The main criteria for version 1 were, the compiler should be able to compile its own source code, and also be bootstrappable by the compiler written in assembly. Each version of the compiler is kept in its own branch, so you can find the latest update for version 1 in the v1 branch, though I am going to mainly reference how the source code looked like at tag "v1.0":
I try to explain Forth concepts as we encounter them, but I'm not going to go into details. You can find many good references. I personally started by reading the online version of the book "Starting Forth". You can find the online version of the book at:
https://www.forth.com/starting-forth/
There was a problem when I started working on the Starforth compiler: I didn't know the source language. Not only I wasn't much of an expert on Forth itself, but I didn't know the particulars of my own variant of Forth, since there are a lot of Forths out there. I immediately dove in writing assembly code for the bootstrap compiler, but I soon realized that not knowing what primitives I'm going to need is a big issue. It was so frustrating that I soon abandoned that route. That effort is not even in the git history.
I then came up with another idea. What if I start writing the Forth code first? I didn't have a compiler to compile and run it, but it was still going to give me a more concrete idea about what I needed to implement.
Many tries later, things started getting more concrete in my head. The main problem that needed to be solved were immediates. An immediate is a word (remember that's a functions in Forth) that are run at compile-time. Lisp programmers would call it a macro. With an interpreter that should be relatively easy to implement, but with an ahead-of-time compiler, I needed to compile the code to memory and run it from there. That sounded like a lot of work. Trouble was, control flow structures like IF...THEN and DO...LOOP are traditionally implemented using immediates, so it looked like that I really needed to have them.
I found the solution some time later when I was reading about preForth/seedForth:
https://github.com/uho/preforth
This is a cool project that aims to bootstrap itself from a small kernel and then grow from there. Looking at what they had done I realized that they don't actually have immediates in their bootstrapper. The only control flow they have is "?exit". That's a word that reads a boolean from the stack and returns if it's true. That sounded like what I needed. It was going to be hard not having any loops or even "if", but that could only add to the fun of the project!
Another issue was variables. Variables are traditionally implemented using Forth's dictionary which maps words to their code. This dictionary most definitely does not exist in the code produced by my compiler, so what should I do? I decided the program will have a block of memory it can use however it wants. There would be a primitive named "mem" that returns the start address to this block of memory. So declaring variables would become something like this:
: foo mem ; : bar mem 4 + ;
So "foo" returns the address of the start of user memory. And "bar" returns the address to 4 bytes after that. As I mentioned in the previous article, variables in Forth return their addresses, so these work exactly like normal Forth variables. We just need to manually assign them addresses in our block of memory.
We also need some string operations. I decided a single string variable is going to suffice for everything. We're going to compile the input one word at a time. We'll implement strings as a buffer that starts with the string length (in the first 4 bytes), followed by the contents of the string. Our single string buffer is only 36 bytes, so our compiler can handle names no longer than 32 bytes. I implemented a couple of helper words to make things slightly easier to read. "str-length" returns the length of the current string, "str-start" returns the start address of the string contents, and "str-ch" receives an index n on the stack, and returns the nth character of the string. "clear-string" sets the string buffer to an empty string, and finally "append-char" appends one character to the end of the string. That should be enough for our compiler.
Let's start with some code. A Starforth program starts with the "main" word. This is what I imagined the compiler's own "main" word would look like:
: main init compile-all epilogue ;
The final version of "main" had a little more to it, but those are the main steps. "compile-all" is the more interesting word here, so let's try implementing that:
: compile-all read eof? ?exit compile tail compile-all ;
This starts by reading a single token. This is read into the compiler's string buffer. We then check if we've hit the end of file. If not we "compile" the token we just read and finally "compile-all" calls itself again to repeat everything. In order to avoid exhausting the stack, we use the "tail" word to eliminate the tail call. I'll explain how that works later. For now, we know this code just keeps reading a word, compiling it, and repeating that until we reach the end of the input file.
Let's see how "compile" is supposed to work:
: compile \ try user-defined words compile-call? ?exit \ try built-ins compile-colon? ?exit compile-semicolon? ?exit compile-mem? ?exit \ ... more built-ins \ try numbers compile-char-literal? ?exit compile-number? ?exit \ unknown word error 1 halt ;
In the absence of any control flow besides "?exit", we repeatedly try calling different "compile-*" words. Each of these returns true if it manages to compile the word in the string buffer. So at first "compile-call" tries looking up the token as a user-defined word and if it is, emits a call instruction for it. If that didn't succeed, we try compiling any of the built-ins we have. If that didn't work either, we trying compiling the token as a number or character literal. If that doesn't work, the compiler exits with a "1" exit code, meaning it encountered an unknown word.
Now let's see how we compile one of the simpler built-ins:
: compile-mem? false str-length 3 <> ?exit 0 str-ch 'm' <> ?exit 1 str-ch 'e' <> ?exit 2 str-ch 'm' <> ?exit drop 57h emit1 \ push edi ; start of user memory area is in edi true ;
Since we don't have any sophisticated string operators, we have to check the string one character at a time. We first check the string length of course. If the token is three characters long and consists of the characters "m", "e", and "m", we go ahead and compile it into a single instruction: PUSH EDI
We expect an initialization code allocate a block of memory and store it's start address in the EDI register, so executing the "mem" primitive simply puts the value of EDI on the stack.
57h (which is a hex literal), is the x86 op code for "push edi". "emit1" emits a one byte value to the output code. We also have emit2, emit3 and emit4.
Compiling all other built-ins follows the same pattern: we put false on the stack and check if the string matches what we want. If not, we return, leaving "false" on the stack. If everything does match, we drop "false" from the stack, emit machine code, push "true" on the stack and return.
Now let's see how user-defined words are implemented. The colon (:) built-in word does that:
: compile-colon? false str-length 1 <> ?exit 0 str-ch ':' <> ?exit drop read dict-add cur-offset @ , \ write the word start offset in the dictionary entry emit-prologue-if-main emit-word-preamble true ;
Let's unpack this one step at a time.
First, we check if current token is indeed a colon. If not, we return false. Otherwise we read another token. This will be the name of the user-defined word.
dict-add is the important part here. It adds the token in the string buffer (the name after colon) to the dictionary, which is a simple key-value map. After that we store the current code offset in the dictionary entry, that is, we associate the name with its code address.
The dictionary is a linked list. We keep a pointer to its tail (the last word added), and each entry has a link to the previous one. A zero in the "previous" field marks the beginning of the dictionary. So dict-add allocates a dictionary entry, links it to the last entry, marks it as the new last entry, and copies the name.
How does memory allocation work? This is pretty much how things are done in normal Forths too. We have a block of memory, and we keep a pointer to where we haven't written anything yet. We call this pointer "here". The comma word (,) reads a value from the stack and writes it to where "here" points to, and then advances "here" to the next free location.
Back to the code above, "dict-add" creates a dictionary entry. After it's done, "here" points to where we can write the "value" of the dictionary entry. A dictionary entry has a "key", which is the word name, and a "value" which is the start code offset of the word. So after "cur-offset @ ," the name after colon becomes associated with current code offset.
emit-prologue-if-main and emit-word-preamble are not terribly important here. The first emits some startup code if the "main" word is being defined. The other, emits some "preamble" code at the beginning of each definition. I'll explain this preamble later.
This is the code that compiles a call to a user-defined word:
: compile-call? dict-find dup 0= ?exit \ we now have a pointer to the value of the dictionary entry (after the \ header). the cell contains an offset to call (which should be converted to a \ relative offset first). @ \ read the actual offset from the dictionary entry 0ec87h emit2 \ xchg ebp, esp cur-offset @ 5 + - \ convert offset to relative value (5 is the length of the call instruction) 0e8h emit1 \ call ... emit4 \ relative address to call 0ec87h emit2 \ xchg ebp, esp true ;
This looks up the name in the dictionary. dict-find performs the lookup. If it returns zero, the name does not exist in the dictionary. If it is found, it returns a pointer to the "value" of the relevant dictionary entry, that is, where we stored the code offset.
Let's ignore the "xchg ebp, esp" instruction for now. I'm going to explain that later. We now need to emit a "call" instruction. The operand of the call instruction is an address relative to the instruction immediately after call. So we add 5 to the current offset, because a call instruction is five bytes long, and then subtract that from the offset dict-find returned.
A tail call is a call that happens as the last action in a subroutine. There's no code after the tail call, we simply return whatever the call returned. These calls can usually be eliminated depending on the programming language. In Forth, there's no "stack frame" as other programming languages, so a tail call can always be converted to a jump instruction without any adjustments to the stack.
This is useful for us, since we don't have any loops in our language yet, and we rely heavily on recursion. The "tail" word is going to help with that. If we eliminate tail calls, we can use recursion instead of loops without having to worry about exhausting the stack. At the earliest versions of Starforth though, I relied on a trick I learned from an IOCCC entry that implemented a simple Forth. This is the definition of the "tail" word:
: tail r> r> drop >r ;
Forth has two stacks, the parameter stack (which is the normal stack used for most operations), and the return stack, which keeps the return address when a word is called.
The r> word moves an item from the return stack to the parameter stack. The >r word does the reverse and moves an item from the parameter stack to the return stack.
When "tail" is called, the top of the return stack contains the address of the instruction right after "tail", which should be a tail recursive call. We move this to the parameter stack. We then move the item right behind it, which is the address of the caller of the word that called "tail". We drop this second address and move the "tail" return address back. So at the end of "tail" we return to where we should have, but one address before that is gone. This way, "tail" prevents the stack from growing without bounds. There's a small problem though. The return address of the word that called the caller of "tail" is removed. For recursive calls that's fine. But we're also eliminating the address of the initial caller. So if "main" calls "compile-all" and "compile-all" recursively calls itself using "tail", we actually never return to main, we try to return to one level above that when "compile-all" is done. This won't work obviously. In order to fix that, we add an extra caller to all words that recursively tail-call themselves. So this will be the actual definition of "compile-all":
: _compile-all read eof? ?exit compile tail _compile-all ; : compile-all _compile-all ;
When "_compile-all" is done, we never return to "compile-all", but return to its caller, which is just fine.
We need to interface with the operating system for certain tasks, like reading and writing files, or exiting the program. Since we only target Linux, we need to call Linux "system calls". Linux system calls on x86 are called using the "int 0x80" instruction. Each system call has a number that is stored in the EAX register. If the system call has any parameters, these are stored in EBX, ECX, EDX, ESI, EDI, and EBP (so a maximum of six arguments) in that order. As an example, one of the simplest system calls is EXIT, which has the number 1 (in 32-bit x86 at least). The following code compiles the "halt" built-in that exits the program using this system call:
: compile-halt? false str-length 4 <> ?exit 0 str-ch 'h' <> ?exit 1 str-ch 'a' <> ?exit 2 str-ch 'l' <> ?exit 3 str-ch 't' <> ?exit drop 0xb8 emit1 \ mov eax, ... 0x00000001 emit4 \ ...1 (EXIT system call) 0x5b emit1 \ pop ebx 0x80cd emit2 \ int 0x80 true ;
The generated code will read a number from the stack and uses that as the first and only argument to the system call (which is exit code). It also sets 1 in EAX and performs "int 0x80".
I initially had built-ins like "halt", "read", "write", etc to allow a program to interface with the operating system. But in the end, I decided to remove those and add a number of built-ins that allow a program to directly call any system call. These built-ins are called "syscall0", "syscall1", "syscall2", all the way through "syscall6". The number at the end shows the number of arguments the system call receives.
Exiting the program with exit code 42 then becomes:
42 1 syscall1
As we said before, Forth has two stacks: a parameter stack and a return stack. An x86 CPU however only supports one stack. This is setup for us by Linux and its address is stored in the ESP register. The code generated by Starforth then allocates a block of memory for the return stack and stores its address in EBP.
Whenever we want to perform an operation that needs to work on the return stack, we perform an "xchg ebp, esp", which swaps the values of EBP and ESP. Afterwards, we emit another "xchg ebp, esp" to switch back to the normal stack. This is a little inefficient of course. We'll make it work slightly better in future versions.
Here are the places where we issue an xchg instruction:
- Before and after each call instruction
- Before a "ret" instruction
- Right at the beginning of a word definition.
So when we want to execute a user-defined word, an xchg instruction is run first to switch to the return stack, we then perform the call which stores the return address on the return stack as we want. Right at the beginning of the callee code, there will be another xchg that switches back to the parameter stack. Before returning, we do another xchg to switch to the return stack, allowing "ret" to read the return address from the correct stack, and then we get to the point after the call instruction that does yet another xchg to switch back to the normal stack.
In order to keep the bootstrap compiler as compact as possible, these are the only built-in words that are supported by the compiler at version 1:
- ":" start a word definition
- ";" end a word definition
- "ps@" get the current value of the parameter stack pointer
- "ps!" write a new address to the parameter stack pointer
- "rs@" get the current value of the return stack pointer
- "rs!" write a new address to the parameter stack pointer
- "@" read the value of a cell (4 bytes) from memory
- "!" write a value to a cell in memory
- "and" bitwise AND two integers
- "-" subtract two integers
- "*" multiply two integers
- "lsr" logical right shift an integer
- "mem" give the pointer to the start of user memory
- "?exit" exit current word if the top of stack is true (non zero)
- "0<" put true (-1) on the stack if current top of the stack is less than zero, otherwise put false.
- "syscall0"-"syscall6": perform a Linux system call with 0 to 6 arguments.
This is very minimal in order to keep the bootstrap compiler small. So initially even "+" is implemented in terms of "-".
Tests are awfully important for a compiler. You keep changing the internals of the compiler, including how language constructs are translated into machine code. We need to make sure things behave the way they should, and they keep behaving like that when we make changes. That's why I had to write some sort of test suite early.
In addition to the language constructs, the test suite also tests some of the compiler internals, so it needs to be able to call those. In addition, since the core language is very minimal, we need access to the utility words written in the compiler. Trouble is, we don't have any library support, or even programs consisting of multiple files. This shell script is what I came up with:
#!/bin/sh # remove the main word in startforth.f, add the test file (which includes its # own main word) and compile the result. { sed 's/: main .* ;//' starforth.f && cat test.f; } | ./sf exit_code="$?" if [ "$exit_code" -ne 0 ]; then echo "Compilation failed with code $exit_code." exit 1 fi # Run the tests ./a.out exit_code="$?" if [ "$exit_code" -ne 0 ]; then echo echo "Some tests failed. Exit code=$exit_code" exit 2 fi
This combines the the compiler source code (starforth.f) with the test suite (test.f), but removes the compiler's "main" definition so we won't have two "main" words. It them compiles the combined source code and runs it.
The test suite itself is just a number of test words, with a "main" that simply calls them, and makes sure each has returned true.
It's time to confess that I did a bit of cheating to make things a little easier for myself. I wrote a couple of other tools to help me with development and debugging. These are not used for bootstrapping, and were actually removed in future versions, but they were still really useful to get me to version 1.
I first wrote a "reference compiler" in Python. This is a small script named "refc.py" and it can compile the Starforth source code into an ELF executable. It basically does the same thing as the compiler itself (starforth.f) and also as the bootstrap compiler (bootstrap.s), but since it's in Python, I could write it very quickly and use it for debugging.
I also wrote a "reference interpreter" in C (refi.c). This allowed me to step through the source code, examine memory and registers and track bugs much more easily.
Both of these are actually capable of bootstrapping the version 1 compiler, but of course they are not used in actual bootstrapping (even though the verify.sh script calls refc too, and compares the result with what the bootstrap compiler generates).
One interesting use of the refc script was to help with debugging in gdb. Let's say there's a crash in the compiler. I go to gdb, but gdb can only show me assembly code and raw addresses. How can we get a stack trace? Enter refc! One interesting thing that refc does when compiling the source code is logging each definition that it compiles along with its address. We can write these into a text file as a sort of poor man's debug info. Then in GDB we examine the contents of where ESP and EBP registers point to. It's usually pretty easy to spot which one contains the return stack and which the parameter stack, since the return stack almost always contains only code addresses.
So by examining memory, we can see the call stack in raw form. By cross referencing these values with what refc has logged, we can map each address to a word name. I also wrote a script (findaddr.py) to help me with this cross referencing. Debugging is still rather difficult, but not impossible.
In this first version, I tried to make things as minimal as possible, even going so far as to not have a "+" word and implementing that in terms of "-". I could have gone even more minimal than this by removing comments, character literals and hex literals. Technically none of those are really necessary, but without them the code would become much harder to read. So I drew a line at that. I can work without built-in addition, but not without comments!
Here are links to some of the materials that helped me.
"Jones Forth" is a thoroughly commented implementation of a small Forth written in assembly:
https://github.com/nornagon/jonesforth
Here are the design notes for the IOCCC entry I mentioned in the article:
https://www.ioccc.org/1992/buzzard.2.design
And finally here's the code repository for preForth:
https://github.com/uho/preforth
We now have a compiler that can compile its own source code. I plan to write more blog entries describing how the language and the compiler evolved after that. Stay tuned!