This is a new attempt at defining a "can't fail" scripting language. The general idea is to design a language that has no error state and can successfully parse and execute any string of binary data (subject to caveats like resource limits, not guaranteeing that any particular program can finish executing, and so on).
My first attempt, with some more background
I've been thinking about this again recently, and on re-reading my earlier writing, I realized that my current thoughts are somewhat different. I ended up writing another modest article, and I think it's closer to my original concept than my first attempt was.
I think I've gone as far as I want to for now. Maybe in a year or two I'll sit down and actually try to build an implementation.
If I understand my terminology correctly, this is a stack-based, concatenative programming language. It uses post-fix notation. Please understand that I haven't really studing programming language design, or any stack-based languages, in any detail, although I read bits and pieces on the internet from time to time. A lot of my explanations of general programming concepts will probably be excessivly simple, or wrong, or both.
The interpreter will successfully parse any binary file, however it is designed with UTF-8 input in mind.
The syntax consists of sequences of non-whitespace bytes separated by sequences of whitespace bytes. Each sequence of non-whitespace bytes is called a "word".
There are two kinds of whitespace bytes: word separators and line separators.
There are two word separator bytes:
There are two line seprator bytes:
Each sequence of bytes between word or line separators is called a word.
Each value on the stack is a sequence of bytes of arbitrary length.
Each value is stored as a sequence of 8-bit bytes. The byte sequence \x71\x65\x64 could be interpreted as the character string "qed" or the decimal number 7,431,524, or as the boolean value "true", or in many other ways depending on how each function is designed to parse it.
Realistically, I don't think I would enjoy working with a language that doesn't have data types. However, the goal of this language could be summarized as "always do something", so it's more fitting for every value to be interchangeable with any other. A more sensible language would have typed values, and operations on unexpected values would produce a "error-type" value (but not crash the program!). But that's too sensible and not interesting to me at this time.
Here is some discussion about some common data types and how they could be represented in this language by byte strings of arbitrary length.
A value may have zero length. This is called `null` or an `empty value`. In this documentation, an empty value is represented as [].
In an unsigned integer, each bit indicates a value of 2^(n-1), where n is the bit position, starting from the right-most bit in the right-most byte.
Unlike most modern programming languages, the parser doesn't attempt to detect "number-like" strings. The programmer must manually identify them using the backslash "\" numeric byte string syntax. That is, the input "9 10 plus." would give the result [1i], which would be surprising to someone expecting an answer in decimal numbers.
Like every value, an unsigned integer may be of any arbitrary length. Some operations will overflow based on the length of the input values. Other operations will increase the length of the value instead of overflowing.
For functions that work with unsigned integers, the overflowing technique is usually the "standard" style, to ensure that operations are fully reversible. That is, an underflow followed by an equivalent overflow will result in the orginal number, but an underflow followed by an "increase length" operation will result in a very different number.
In a signed integer, the left-most bit in the left-most byte is used to indicate positive or negative. If the bit is 0, the number is positive, and the remaining bits work the same as an unsigned integer. If the bit is 1, the number is negative, and the remaining bits are equal to the unsigned underflow of zero minus the absolute value of the negative number.
For functions that work with signed integers, the "increase length" technique is usually the "standard" style, instead of over/underflows. If an operation would cause the number to exceed the size of the value, the length is increased to fit the new value instead. In this way, the minimum and maximum values of an unsigned integer are limited only by the resources available to the interpreter.
A floating point number is defined with two parts: the mantissa and the exponent. Both values are base-2 numbers (binary floating point).
The exponent is stored in the left-most bytes and the mantissa is stored in the right-most bytes. Both values may be of arbitrary length. Both values may be either positive or negative.
Starting from the left-most byte, if the left-most bit of that byte is 1, then the byte belongs to the exponent. Continuing right-ward, if the left-most bit of that byte is 1, then it also belongs to the exponent. Continue until the left-most bit of a byte is 0. That byte, and all bytes to the right of it, belong to the mantissa. The two left-most bits of the left-most byte of the mantissa are swapped, and then the entire mantissa is treated as a signed integer. (This is another way of saying that the second-left-most bit in the left-most byte of the mantissa is the sign bit.)
If the left-most bit of the left-most byte is 0, then the exponent is zero and all bytes belong to the mantissa. If the left-most bit of every byte is 1, then the mantissa is zero. For this reason (and others), the same number can be represented in more than one way.
When combining two floating point numbers, the length of the resulting number may be as long as the combined length of both inputs.
(Maybe the mantissa sign should be the left-most bit, before the exponent section? I don't know that it matters really.)
Probably there should be some functions for properly handling byte strings as Unicode.
If the interpreter runs out of allocated memory, let's say that it just drops the oldest items from the stack until it recovers enough memory to continue. No problems with that approach right???
If the interpreter attempts to make a single value that is longer than the total memory available to it, let's say that it truncates the value to the available memory space. (That hardly seems worth it, since it will just drop the value as soon as another one is pushed onto the stack...)
In this way, we can fool-hardily commit to the goal of eliminating (or at least minimizing) error states at the cost of difficult-to-predict behavior that will vary based on available resources.
The stack is a list of values in a sequence. Note that each value may be of any length, so the number of values in the stack and the size of the stack in bytes are not predicatably related.
To add a value to the stack, we say that we "push" it. To remove the most recently pushed value from the stack, we say that we "pop" it.
There are always an infinite supply of null values on the bottom of the stack. A stack that only contains null values is called an empty stack. If a null value is popped onto an empty stack, the effect is the same as if it had been discarded.
In this plain-text documentation, values on the stack are written inside square brackets separated by spaces, like these four example values:
[apple] [\x01] [] [day]
In addition to the stack, the language maintains a dictionary of defined words. This is how the language handles concepts like functions and variables. The interpreter can replace a word with its definition from the dictionary.
The parser reads the input one word at a time. Each word is evaluated before moving on to the next word.
If the word matches a dictionary key (either a built-in function, or a user-defined function), then the entry at that key is prepended to the input and parsing continues.
If the word doesn't match any function, then the word is placed onto the stack as a new value.
In this plain-text documentation, input to the parser is represented inside of quotation marks, like this example input sequence:
"apple :day \x01 doctor"
All input is parsed as a byte, so the input "1" in a UTF-8 file has a decimal value of 49.
The back-slash character "\" (\x5c) instructs the parser to interpret the next bytes as numeric digits instead of raw bytes.
As soon as the parser encounters a character outside the expected range, any encountered digits are converted to one or more appropriate bytes and parsing continues as normal from the following character. A backslash followed by any unexpected character is equivalent to "\0", a zero-value byte.
Numeric byte sequences can be used to specify spaces, line breaks, or other special characters inside values. For example, "one two" evaluates to [one] [two], but "one\x20two" evaluates to [one two].
Maybe numeric parsing also supports underscores "_" (\x5f) as a digit separator?
Maybe we also need a way to explicitly end a numeric byte sequence, in case you want to, for example, follow a numeric string directly with a number character? e.g. if you want "\x20" followed by characters "12", it would look like "\x2012" which would be parsed as a "2012" hex value, which is not desired in this case. As a work-around, you could either specify the byte sequence as a separate word and then join them, or else specify the following numbers as an appropriate byte sequence also. Neither approach is ideal, but both are valid. Maybe borrow the equal sign (=) which is used in base64 at the end of a sequence?
A numeric byte indicator may appear anywhere in a word, and will always be parsed before evaluating the word. "\x3atest" and ":test" will evaluate in the same way.
When it appears as the first byte in a word, the colon character ":" (\x3a) indicates that the remainder of the word should be added to the stack as a value, not evaluated as a function. A colon and the word after it are called a quoted text sequence.
For example, "one dup." evaluates to [one] [one], but "one :dup." evaluates to [one] [dup.].
A colon appearing anywhere after the first byte in a word has no special behavior. It is parsed as a normal byte.
A colon followed by a word or line separator produces the nil [] value (a value of zero length).
The special word consisting of only the tilde character "~" (\x7e) indicates that the remainder of the line must be added to the stack as a single value, not parsed as individual words and not evaluated.
Any word separators immediately after the tilde and any line separator that terminates the quoted text sequence are not included in the value that is pushed to the stack.
A tilde followed by only word and line separators produces the nil [] value (a value of zero length.
I've tried very hard to avoid any syntax with both an opening and closing tag. For example, the most common way to represent a text string in other programming languages is to use the quotation mark (") as both an opening and closing tag for the string. Some languages support multi-line text strings that begin and end with something like a triple-apostrophe (''').
For stylistic reasons, I didn't want any closing tags, but I don't think that my solutions are very satisfying. They make it difficult to quote and un-quote code, or to clearly define functions. Oh well. Maybe I'll come up with something else later on.
The special word consisting of only the number sign "#" (\x23) indicates that the remainder of the line is a comment, and should not be parsed at all.
Sorry, I don't know how to write this formally, so this is a very loose summary.
Starting from the first byte in the input sequence...
1. Discard any word separators and line separators until a non-separator value is encountered.
2. When encountering a backslash "\", parse the next characters as numeric values instead of raw bytes, as described earlier. Continue until any unexpected characters are found, then resume parsing the word as normal.
3. When encountering a word separator, attempt to parse the word:
3.1. If the word is the special keyword "~", then parse the remainder of the line as a single value.
3.2. If the word is the special keyword "#", then discard the remainder of the line without parsing it.
3.3. If the first character in the word is a colon ":", then add the remainder of the word as a new value on the stack.
3.4. If the word matches a key in the dictionary, then prepend the matching dictionary value to the input sequence and continue parsing.
3.5. Otherwise, add the word as a new value on the stack.
4. Return to step 1 to resume parsing the next word.
A function word is a word that is defined in the dictionary. When a word is defined in the dictionary, evaluating that word can change the input sequence, which can cause the interpreter to add or remove values on the stack.
By convention, the names of function words are usually lowercase alphabetical characters ending in a punctuation mark. By convention, function words that read or modify the stack end in a punctuation mark, while function words that add to the stack without reading it (basically variables) do not. By convention, a function word ending in a question mark "?" (\x3f) will push either [\0] or [\1] onto the stack. By convention, most other functions end in a period "." (\x2e).
The rest of this document starts to list out the kind of functions that I would expect to see in this language's standard library. This is maybe the least interesting part of the language, and if I was implementing this for real I would probably be a lot more thorough about learning what functions are defined in other stack languages.
Each function definition below also indicates the effect on the stack using this common stack effect notation. The stack effect line begins with a number sign "#" to indicate that this is a comment. The items before the double-dash "--" represent values on the stack before the function, and the items after the double-dash represent what will be on the stack in place of those values after the function is evaluated.
For example, this indicates that the positions of the top two values will be reversed:
# a b -- b a
Literal stack values are enclosed in square brackets.
For convenience, some characters with special meaning to the parser are named as function words.
space # -- [\x20] tab # -- [\x09] new-line # -- [\x10] return # -- [\x13] backslash # -- [\x5c]
Pushes the corrersponding character's byte value onto the stack.
null # -- []
Pushes a null value onto the stack.
a drop. # a --
Discard the top value on the stack.
a dup. # a -- a a
Push a copy of value a onto the stack.
a b swap. # a b -- b a
Swap the order of the top two values on the stack.
a b c rotate. # a b c -- b c a
Bring the third item of the stack to the top, pushing the two above it down in the stack.
In these functions, the length of the result will be the same as the length of the longest input value. The result will overflow if it would exceed the length of the longest value, or underflow if it would go below zero.
For example, "\x03 \x0008 subtract." will produce [\xfffa], and "\x000003 - \x0008 subtract." will produce [\xfffffa].
(I also want to define some "unbounded" math functions that will increase the length of the output instead of over/underflowing, but that didn't seem appropriate for the standard operations.
a b add. # a b -- c
Pop a and b, and push the result of a plus b. If either value is null, the result is null.
a b subtract. # a b -- c
Pop a and b, and push the result of a minus b. If either value is null, the result is null.
a b multiply. # a b -- c
Pop a and b, and push the result of a times b. If either value is null, the result is null.
a b divide. # a b -- c d
Pop a and b. Push the integer result of the division as c and the integer remainder as d.
("divide. drop." would get the integer division result only. "divide. swap. drop." would get only the remainder. Maybe these deserve to be their own named functions?)
Assume there are some functions for working with signed integer data. I'm not going to write them all out.
Again, assume that there are the usual functions for working with floating point numbers. See the section on floating points in the Data Types section above for some pointers.
By convention, the null value and values of any length that only contain \0 bytes are considered "falsey", and any other value is considered "truthy". Boolean functions will return [\1] for true results and [\0] for false results.
true # -- [\x01] false # -- [\x00]
Add the true or false value to the stack.
a true? # a -- b
Pop a, and push [\x01] if the value is truthy or [\x00] if the value is falsey.
a false? # a -- b
Pop a, and push [\x01] if the value is falsey, or [\x00] if the value is truthy.
a b or? # a b -- c
Pop a and b, and push [\x01] if either value is truthy, or [\x00] if both values are falsey.
a b and? # a b -- c
Pop a and b, and push [\x01] if both values are truthy, or [\x00] if either value is falsey.
a b xor? # a b -- c
Pop a and b, and push [\x01] if exactly one value is truthy, or [\x00] if both values are truthy or both values are falsey.
These perform exact comparisons between values.
a null? # a -- b
Pop a, and push [\x01] if a is null, or [\x00] otherwise.
a b equal? # a b -- c
Pop a and b, and push [\x01] if both values are equivalent (differing only in leading zero values) or [\x00] if they are not. If either value is null, the result is null (even if they both are; use "exact?" or "null?" to test for null values).
a b exact? # a b -- c
Pop a and b, and push [\x01] if both values are the exactly the same (same length and same bytes), or [\x00] if they are not. If exactly one of the two values is null, the result is null.
a b greater. # a b -- c a b greater-or-equal. # a b -- c
Pop a and b, and push [\x01] if a is greater than b (>) or greater than or equal to b (>=), or [\x00] if not. If either value is null, the result is null.
a b less. # a b -- c a b less-or-equal. # a b -- c
Pop a and b, and push [\x01] if a is less than b (<) or less than or equal to b (<=), or [\x00] if not. If either value is null, the result is null.
These function manipulate strings of bytes. These are not designed to handle Unicode strings. I should define a separate set of functions for handling proper Unicode strings.
Note that an empty byte string is the null value.
a print. # a --
Pop a. Write the value to standard output. (Probably handle non-UTF-8 characters or sequences somehow.)
a length. # a -- b
Pop a. Push the length of a (in bytes) onto the stack.
a b cat. # a b -- c
Pop a and b, and push the concatenation of a and b onto the stack.
a b c replace. # a b c -- d
Pop a and b, and push a modification of string a, where each pattern matching string b has been replaced by string c. If a is null, the result will be null. If b or c is null, the result will be a.
a b c substring. # a b c -- d
Pop a, b and c, and push the portion of the string a from byte b to byte c.
a explode. # a -- a...
Pop a, and push each byte of the string onto the stack.
a reverse. # a -- b
Pop a, and push a modification of a with bytes in reverse order onto the stack.
I would put these into a library, except that they are needed for importing, which seems fundamental?
(What exactly is a file reference? Is this just an arbitrary integer that points to more information outside the stack?)
a open. # a -- b
Pop a. Open a connection to the file on disk at location a. Push a file reference to the stack.
a close. # a --
Pop a. Close the connection to the file on disk at location a.
a b write. # a b --
Pop a and b. Write the value a to the file b (if b is a valid file reference).
(Or maybe it's better to swap the order of the file reference and the output data?)
a read-word. # a -- b a read-line. # a -- b a read-file. # a -- b
Pop a. Read the next word, line, or the remainder of the file and push the value onto the stack. Lines and files are not parsed or split. They are placed on the stack as a single value.
(We must keep a cursor somewhere that indicates where we are in the file, so we can read additional lines? Probably the responsible approach is to just put a number on the stack and use it as an input into any of the read functions.)
Be careful! The language doesn't prevent you from overriding built-in functions with your own functions!
a b set. # a b --
Pop a and b. Store the value a in the dictionary under the key b.
After b is defined, when the parser evaluates the word `b`, `a` will be evaluated in its place.
Be careful to specify both parameters as literal text, especially if you want to overwrite an existing function.
Example:
~ dup. add.
:double. set.
The first line stores [dup. add.] onto the stack. The second line adds [double.] to the stack, and then stores [dup. add.] in the dictionary under the key "double.".
a set? # a -- b
Pop a. Push [\x01] if the dictionary contains an entry under the key a, or [\x00] if it doesn't. Be sure to specify a as a text literal, not call the function.
a do. # a -- ..
Pop a. Evaluate the dictionary value with key a. For example, ":add. do." is the same as "add.". Note that, unlike normal parsing, if the function isn't found, then nothing will be placed on the stack. (Or should it do that still?)
a b import. # a b -- ..
Pop a and b. Load a file from location a on disk. Parse the contents of the file before continuing to parse the remaining input.
Both the dictionary and the stack will be modified by commands in the parsed file. Ideally, the file should be written to leave an unmodified stack, and to use a namespace for any functions that it defines.
(I wanted to support self-contained libraries, but this looked like it would add too much complexity.)
a quote. # a -- b
Pop a. Replace any word separators, line separators, or numeric byte indicators characters in the value a with an appropriate numeric byte sequence, and push the result. (Is this sufficient? Or is this a fool's approach?)
a unquote. # a -- b
Pop a. Evaluate the value and push the result.
emptyhallway
2023-12-13