đŸ Archived View for whyread.us âș en âș computers âș languages âș henry--janet_for_mortals âș chapter-04.⊠captured on 2024-08-24 at 23:52:23. Gemini links have been rewritten to link to archived content
âŹ ïž Previous capture (2024-08-18)
-=-=-=-=-=-=-
Janet does not have native, built-in regular expressions.
You *can* use a third-party regular expression library if you really have to, I dunno, validate an email address or something. But most of the time, if youâre writing Janet, youâll be writing PEGs instead.
PEG stands for âparsing expression grammar,â which is a mouthful, so Iâm going to stick with the acronym, even though I just wrote a whole chapter about macros without abbreviating AST once.
As a first â extremely crude â approximation, you can think of PEGs as an alternative notation for writing regular expressions. Thatâs not actually correct â PEGs are quite a bit more powerful than regular expressions, for starters, and they behave differently in a few respects â but we have to start somewhere, and this will let us re-use a lot of our existing knowledge.
Here, letâs look at a few random regular expressions, and see how weâd write them in PEG format.
regex: .* peg: (any 1)
`1` means âmatch one byte.â `any` means âzero or more.â
regex: (na)+ peg: (some "na")
Strings match literally. There are no special characters to escape. `some` means âone or more.â
regex: \w{1,3} peg: (between 1 3 (choice :w "_")) peg: (between 1 3 (+ :w "_"))
Janetâs `:w` does not include `_`, so we use `+` to say âword character or underscore.â `(+ ...)` is an alias for `(choice ...)`. `between` is inclusive on both ends.
regex: [^a-z-] peg: (not (choice "-" (range "az"))) peg: (! (+ "-" (range "az")))
`(! ...)` is an alias for `(not ...)`. You can negate any PEG, not just character classes.
regex: [a-z][0-9]? peg: (sequence (range "az") (opt (range "09"))) peg: (* (range "az") (? (range "09")))
`*` matches all of its arguments in order. `(* ...)` is an alias for `(sequence ...)`. `?`` means âzero or one,â and `(? ...)` is an alias for `(opt ...)`.
Those are pretty random examples, and this is nowhere near an exhaustive list, but itâs enough for you to start forming a general idea. Letâs notice a few things from this:
I wouldnât want to write PEGs for, like, searching in my text editor, but in code I think the verbosity is almost always a good thing: it makes them easier to read and easier to modify.
And when youâre writing ârealâ PEGs, you will break up large patterns into smaller, named components, which will prevent any single pattern from becoming unwieldy.
Like `(+ first second third)`. Thatâs not addition; itâs choice. How does that work?
Well, I didnât state this explicitly, but PEGs are actually written *quoted*. A PEG is not `(some "na")`; itâs actually `['some "na"]`. There is no function called `some`; the symbol itself is meaningful to the functions that consume PEGs.
Itâs conventional to write PEGs as quasiquoted forms: `~(some "na")`, so that you can easily interpolate other values into them. (Weâll get to that soon.)
This makes it easy to *compose* PEGs out of smaller pieces, which weâll start to do soon, or to write functions that manipulate PEGs in the same way that we are used to manipulating abstract syntax trees.
PEGs arenât Janet abstract syntax trees, but you can see that they have a lot in common: they represent a tree structure out of nested tuples, lots of quoted symbols, and some numbers or strings or other values mixed in as well. In fact there is a general term for this kind of value: both abstract syntax trees and PEGs are examples of âsymbolic expressions.â
Alright, now letâs talk about some of the ways that these patterns differ from their regular expression equivalents.
First off, PEGs are always anchored to the beginning of the input, so thereâs no equivalent âstart of inputâ pattern. So `(any 1)` is *actually* equivalent to the regular expression `^.*`.
Except, no, thatâs not strictly true. Because PEGs *do not backtrack*. Which means that all repetition is implicitly âpossessiveâ to use the regular expression term. So `(any 1)` is *actually actually* equivalent to `^.*+``, which is not a construct that JavaScriptâs regular expression engine supports.
The distinction is irrelevant in this case, but it matters for something like `[ab]*[bc]` â that will match `bbb`, but `[ab]*+[bc]`, or the equivalent PEG `(* (any (+ "a" "b")) (+ "b" "c"))`, will not.
PEGs *do* backtrack when using the `choice` combinator, as well as a few others. But backtracking is always obvious and explicit, as opposed to regular expressionsâ implicit backtracking everywhere. This makes it less likely that youâll accidentally write a PEG that executes in exponential time.
Alright. Thereâs one more thing we should talk about before we get to a concrete example: numbers.
Weâve seen `1` already, as a way to match any byte. You can write any other integer â `2`, say, or even `3` â to match exactly that number of bytes.
But you can also write *negative* numbers. Negative numbers donât advance the input at all, and they fail if you *could* advance that many characters. So `-4` will fail unless there are *fewer* than four bytes left in the input. In practice Iâve only ever used this to write `-1`, which means âend of inputâ. I donât think `-1` is a particularly intuitive way to write âend of input,â so I wanted to call this out ahead of time.
Now that weâve covered the basics, letâs look at a real example. Letâs write an HTML pretty printer.
(defn element-to-struct [tag attrs children] {:tag tag :attrs (struct ;attrs) :children children}) (def html-peg (peg/compile ~{:main (* :nodes -1) :nodes (any (+ :element :text)) :element (unref {:main (/ (* :open-tag (group :nodes) :close-tag) ,element-to-struct) :open-tag (* "<" (<- :w+ :tag-name) (group (? (* :s+ :attributes))) ">") :attributes {:main (some (* :attribute (? :s+))) :attribute (* (<- :w+) "=" :quoted-string) :quoted-string (* `"` (<- (any (if-not `"` 1))) `"`)} :close-tag (* "</" (backmatch :tag-name) ">")}) :text (<- (some (if-not "<" 1)))})) (defn main [&] (def input (string/trim (file/read stdin :all))) (pp (peg/match html-peg input)))
Okay wow; weâre just diving right in huh.
First off, this isnât really an HTML pretty printer; this is only an HTML *parser*. Well, strictly speaking, itâs a parser for a small subset of HTML â enough to make a point, without getting bogged down in minutiae.
So what are we looking at here?
First off, the outer pattern is a struct. The keys are names, and the values are patterns, and these patterns can reference other patterns by name â even recursively. Even *mutually* recursively, as you can see with `:nodes` and `:element` referring to one another.
Weâve seen named patterns like `:w` before, when I said it was an analog of regular expressionsâ `\w`. But those are only the default pattern aliases, and by writing a struct like this we can create our own custom aliases, with scoping rules that make sense: patterns inside nested structs can refer to elements in the âouter struct,â but not the other way around.
Okay. Now letâs try to go through these individual patterns and make sure we understand them.
:main (* :nodes -1) :nodes (any (+ :element :text))
The name `:main` is special, as that will be the patternâs entry-point. This `:main` just calls `:nodes`, which matches zero or more `:elements` or `:text`s, and then asserts that thereâs no input left. `(* "x" -1)` is like the regular expression `^x
:text (<- (some (if-not "<" 1)))
`:text` uses a combinator that we havenât seen before: `<-`.
`<-` is an alias for `(capture ...)`. We havenât talked about captures yet, but they work similarly to regular expressionsâ captures.
Just to quickly review, consider the regular expression `<([^<]*)>`. The parentheses around the innards there mean that there is a single âcapture group,â and if we run this expression over a string, we can extract that match:
node Welcome to Node.js v16.16.0. Type ".help" for more information. > /<([^<]*)>/.exec('<hello> there') [ '<hello>', 'hello', index: 0, input: '<hello> there', groups: undefined ]
This returns an array of captured groups. The first element is the entire substring that matched the regular expression; the second is the text that matched the first (and in this case only) capture group.
> /<([^<]*)>/.exec('<hello> there')[1] 'hello'
PEGs work similarly: when you match a PEG over a string, you get a list of captures back.
repl:1:> (peg/match ~(* "<" (any (if-not ">" 1)) ">") "<hello>") @[]
The list is empty here, because PEGs donât implicitly capture anything. We have to explicitly ask for a capture, using `<-`:
repl:2:> (peg/match ~(* "<" (<- (any (if-not ">" 1))) ">") "<hello>") @["hello"]
We could also capture the entire matching substring, if we wanted to:
repl:3:> (peg/match ~(<- (* "<" (<- (any (if-not ">" 1))) ">")) "<hello>") @["hello" "<hello>"]
But note that captures show up âinside out.â `(<- pat)` first matches `pat`, which might push captures of its own, and *then* it pushes the text that `pat` matched.
So far this looks basically like regex country. But PEGs allow you to do *so much more* with captures. Here, letâs look at a slightly more interesting example:
repl:4:> (peg/match ~(* "<" (/ (<- (any (if-not ">" 1))) ,string/ascii-upper) ">") "<hello>") @["HELLO"]
We have to unquote `string/ascii-upper` because we actually want the *function* in our PEG, not the symbol `'string/ascii-upper`. This is why weâre using quasiquote instead of regular quote.
`(/ ...)` is an alias for `(replace ...)`, which is a misleading name: if you pass it a function, it doesnât *replace* the capture with the function, but actually maps the function over the captured value. And if you pass it a table or a struct, it looks up the capture as a key and replaces it with the value. If you pass any other values, then it actually replaces. (If actually you want to actually replace a capture with a function or a table, you have to wrap it in a function that ignores its argument.)
So weâre mapping the function `string/ascii-upper` over the value captured by `(<- (any (if-not ">" 1)))`, which happens to produce a new string. But it doesnât have to!
repl:5:> (peg/match ~(* "<" (/ (<- (any (if-not ">" 1))) ,length) ">") "<hello>") @[5]
Our captures can be *any* Janet values â they donât have to be strings. `(<- pat)` always captures the string that `pat` matches, but you can always map it, and there are other combinators that capture other things. Take `
repl:6:> (peg/match ~(* "<" (* (<- (any (if-not ">" 1))) ($)) ">") "<hello>") @["hello" 6]
`($)`` is an alias for `(position)`. Itâs a pattern that always succeeds, consumes no input, and adds the current byte index to the capture stack. Thereâs also `(line)` and `(column)`, which do what you expect.
But the most useful capture alternative is the `constant` operator. (constant x) always succeeds, consumes no input, and adds an arbitrary value to the capture stack. Itâs useful for parsing text into something with a little more structure:
repl:7:> (peg/match ~(any (+ (* "â" (constant :up)) (* "â" (constant :down)) (* "â" (constant :left)) (* "â" (constant :right)) (* "A" (constant :a)) (* "B" (constant :b)) (* "START" (constant :start)) 1)) "ââââââââ B A START") @[:up :up :down :down :left :right :left :right :b :a :start]
Unconditional capture with `constant` is useful, but note that in this particular case we would probably just write:
repl:8:> (peg/match ~(any (+ (/ "â" :up) (/ "â" :down) (/ "â" :left) (/ "â" :right) (/ "A" :a) (/ "B" :b) (/ "START" :start) 1)) "ââââââââ B A START") @[:up :up :down :down :left :right :left :right :b :a :start]
Okay. This has been: PEG Captures 101. Now letâs get back to our HTML example.
:text (<- (some (if-not "<" 1)))
Right. So `(<- (some (if-not "<" 1)))` is equivalent to the regular expression `^([^<]++)`. It tries to match `"<"`, and *if that fails* â if the next character is not `<` â then it advances by one character. And then it repeats, until it finds a `<` character or runs out of input, and finally it adds the entire string it consumed to the capture stack.
So if we give it the following input, itâs going to match the following substring:
hello yes this is <b>janet</b>
Easy. The next part is⊠not so easy.
:element (unref {:main (/ (* :open-tag (group :nodes) :close-tag) ,element-to-struct) :open-tag (* "<" (<- :w+ :tag-name) (group (? (* :s+ :attributes))) ">") :attributes {:main (some (* :attribute (? :s+))) :attribute (* (<- :w+) "=" :quoted-string) :quoted-string (* `"` (<- (any (if-not `"` 1))) `"`)} :close-tag (* "</" (backmatch :tag-name) ">")})
But weâll take it one step at a time, and itâll be fine.
The whole pattern is wrapped in `unref`, but I canât actually explain that until the end, so weâll skip over it for now and jump straight to `:main`. Weâll circle back to `unref` after we talk about backreferences.
:main (/ (* :open-tag (group :nodes) :close-tag) ,element-to-struct)
So an `:element` consists of an opening tag, some child nodes, and then a matching closing tag. Like `<i>hello</i>`.
But we donât match `:nodes`; we match `(group :nodes)`. Because recall that `:nodes` is going to push multiple nodes onto the capture stack:
:nodes (any (+ :element :text))
Specifically, anything captured in `:element` or `:text`. But `(group :nodes)` says âwell, instead of pushing every capture individually, wrap all the captures into a tuple and push that tupleâ. So weâll match multiple nodes, but weâll only have a single (possibly empty!) list of nodes on the capture stack when weâre done.
After we parse all of a tagâs individual components â tag name, attributes, and children â weâll call `element-to-struct` to wrap it up into a nicer format. Note that `element-to-struct` actually takes three arguments: one for each of `:element`âs capture groups. (The tag name and attributes are captured by the `:open-tag` sub-pattern.)
But actually matching the tags is the interesting bit.
:open-tag (* "<" (<- :w+ :tag-name) (group (? (* :s+ :attributes))) ">") :close-tag (* "</" (backmatch :tag-name) ">")
I want to draw your attention to `(<- :w+ :tag-name)`. This is a *tagged* capture, and `:tag-name` is its âtagâ. When you tag a capture, you can refer back to it later in the match â thatâs exactly what `(backmatch :tag-name)` does.
But hark! There might be *multiple* tagged captures to contend with.
<p>If you have <em>nested</em> tags</p>
`<p>` will push a tagged capture to the stack, and so will `<em>`. So now there are two captures tagged `:tag-name`. But when we `backmatch`, weâre going to look for the most recent time we tagged a capture with `:tag-name` â which is going to be `"em"`. This will match `</em>` successfully, but of course it will fail once we get to `</p>`.
And thatâs bad! What we want to do is âscopeâ the tagged matches, so that parsing the `<em>` tag doesnât leak out to our parsing of the `<p>` tag.
So thatâs exactly what `unref` does. It says âafter youâre done parsing this pattern, remove all of the tags that you associated with any captures.â By wrapping `unref` around our `:element`, we make these tagged captures local to each `<tag>`.
Wow, it sure is confusing that HTML tags are called tags and capture tags are also called tags. Someone really could have picked a better example to introduce tagging, huh?
Okay, now you might be thinking: why is this a problem? Sure, we pushed `"em"` to the capture stack, but then we popped it off! We `replace`d it with an untagged struct when we called `element-to-struct`, right? Why can `backmatch` still see it?
Well, tagged captures are actually separate from the capture stack. `backmatch` doesnât look for âthe uppermost capture on the stack with this tagâ â the tags donât live on the capture stack at all. `backmatch` actually looks for âthe last time we captured something with this tag.â
To help make this make sense, Iâm going to describe a model of how you might implement a simple PEG matcher. Weâll keep track of two pieces of state: a stack of stacks, and a stack of tag scopes. Weâll start with a single stack on the stack-stack, and a single scope on the scope-stack, and different combinators will manipulate these.
The `group` combinator, for example, pushes a new stack onto the stack-stack, executes its pattern, and then pops that new stack and pushes it onto next highest stack (as a tuple). The `replace` combinator pushes a new stack, executes its pattern, then pops it off the stack-stack, passing its contents as positional arguments to its function. And then it pushes the return value to the new topmost stack on the stack-stack.
Meanwhile `unref` pushes a new *tag scope*, executes its pattern, and then pops the tag scope once itâs done. `unref` is the only combinator that affects the tag scope stack.
You can actually pass a specific named tag to `unref` to only âscopeâ that *particular* tag name, allowing you to leak some tags into the outer scope. So in that case `unref` pushes a new tag scope, executes its pattern, and then copies everything *except* the named tag into the outer scope.
Alright. Now the only thing we havenât talked about is the `:attributes` bit.
:attributes {:main (some (* :attribute (? :s+))) :attribute (* (<- :w+) "=" :quoted-string) :quoted-string (* `"` (<- (any (if-not `"` 1))) `"`)}
And I actually donât think thereâs much to say about this? Youâve seen it all already. This is easy. `:s+` is âone or more whitespace characters,â and is one of many named patterns available by default.
Alright. That wasnât so bad, was it?
(defn element-to-struct [tag attrs children] {:tag tag :attrs (struct ;attrs) :children children}) (def html-peg (peg/compile ~{:main (* :nodes -1) :nodes (any (+ :element :text)) :element (unref {:main (/ (* :open-tag (group :nodes) :close-tag) ,element-to-struct) :open-tag (* "<" (<- :w+ :tag-name) (group (? (* :s+ :attributes))) ">") :attributes {:main (some (* :attribute (? :s+))) :attribute (* (<- :w+) "=" :quoted-string) :quoted-string (* `"` (<- (any (if-not `"` 1))) `"`)} :close-tag (* "</" (backmatch :tag-name) ">")}) :text (<- (some (if-not "<" 1)))})) (defn main [&] (def input (string/trim (file/read stdin :all))) (pp (peg/match html-peg input)))
When you look at it all at once, it is pretty intimidating. But just think what the equivalent regular expression would look like! Oh, wait. You canât. Parsing HTML with regexes is famously impossible.
Weâve already seen a lot of useful PEG combinators, but weâre not limited to the built-in operations that Janet gives us. We can actually interleave arbitrary functions into a PEG, and use them to guide the matching process. This allows us to write custom predicates to express complicated matching logic that would be very difficult to implement natively in a PEG (âidentifier with more vowels than consonantsâ), but itâs especially useful when we already have a regular function that knows how to parse strings.
For example, `scan-number` is a built-in function that parses numeric strings into numbers:
repl:1:> (scan-number "512") 512 repl:2:> (scan-number "512x") nil
If we wanted to parse a number somewhere in a PEG, then⊠well, weâd use the built-in `(number)` operator that does exactly that. But letâs pretend that that doesnât exist for a second, and try to implement it in terms of `scan-number`. Hereâs a first attempt:
repl:1:> (peg/match ~(/ (<- (some (+ :d (set ".-+")))) ,scan-number) "123") @[123]
That works, sometimes. But of course that number pattern is not very accurate, and we already saw that `scan-number` will return `nil` if we give it a bad input:
repl:2:> (peg/match ~(/ (<- (some (+ :d (set ".-+")))) ,scan-number) "1-12-3+3.-++") @[nil]
But the match still succeeded, and captured `nil`, because that was what we told it to do.
So we *could* try to carefully write a valid number pattern here, such that we only ever pass valid input to `scan-number`. But we donât want to do that. That sounds hard. We just want the pattern to *fail* if `scan-number` canât actually parse a number.
Enter `cmt`:
repl:3:> (peg/match ~(cmt (<- (some (+ :d (set ".-+")))) ,scan-number) "1-12-3+3.-++") nil
So `cmt` is very similar to `replace`, except that if your function returns something falsy (remember: just `nil` or `false`), then the `cmt` clause itself will fail to match. Itâs sort of like a `map` vs `filterMap` situation.
Of course in real life, as previously mentioned, weâd just write this:
repl:1:> (peg/match ~(number (some (+ :d (set ".-+")))) "123") @[123]
`cmt` stands for âmatch-time capture,â apparently, even though the letters are not in that order. The name comes to us from LPEG, the Lua PEG library that inspired Janetâs PEG library, where all capture-related functions start with `C`. Itâs a very useful function despite the confusing name, and thereâs something else that makes it even more useful: the `->`` operator.
`->` stands for `backref`, and it looks quite strange at first glance: all it does is re-capture a previously tagged capture. If you just used it by itself, it would duplicate previously tagged captures onto the capture stack and consume no input, which doesnât sound very useful.
repl:1:> (peg/match ~(* (<- :d+ :num) (-> :num)) "123") @["123" "123"]
But if you use it inside the pattern you pass to `cmt`, you can add previous captures as arguments to your custom mapping predicate.
Hereâs a concrete, if extremely dumb, example: Iâve invented my own HTML dialect that is identical to regular HTML, except that `<head>` tags can optionally be closed with a `</tail>` tag, because thatâs modestly whimsical.
Previously we were able to use `backmatch` to match closing tags, because they happened to be bytewise-identical to the values we captured in `:open-tag`:
:close-tag (* "</" (backmatch :tag-name) ">")
But now thatâs no longer true, and `backmatch` isnât sufficient to handle this very practical HTML dialect. Weâll have to write some logic:
(defn check-close-tag [open-tag close-tag] (or (= open-tag close-tag) (and (= open-tag "head") (= close-tag "tail")))) :close-tag (* "</" (drop (cmt (* (-> :tag-name) (<- :w+)) ,check-close-tag)) ">")
Notice that we âre-captureâ `:tag-name`, in addition to capturing the `:w+``. Because `cmt` needs a single pattern to execute, I stuck them together with `*``, but both of these captures will be passed as arguments to `check-close-tag`.
Neat.
We are now very close to knowing everything there is to know about PEGs, but I think we should talk about one more thing before we leave this chapter:
Regular expressions arenât just useful for matching or extracting text. Theyâre also useful for *changing* text.
Regex replace is a common primitive operation; you use it all the time in your editor or with `sed` or whatever. And of course Janet has a native `peg/replace` function, and weâre going to talk about it soon.
But letâs just *pretend*, for a moment, that it doesnât exist. Because it turns out that you donât actually *need* a built-in PEG replace function: you can implement replacement as a special case of capturing.
Itâs a pretty simple trick: weâre going to write a PEG that captures two things: the part of the string that matches the pattern we want to replace, and the entire rest of the string.
Just so we have something concrete to work with, letâs write a chaotic evil PEG: given a string, weâll find all of the Oxford commas in that string, and replace them with Oxford semicolons.
So given input like this:
this is dumb, confusing, and upsetting
Weâll wind up with:
this is dumb, confusing; and upsetting
Naturally.
So the PEG itself is easy: we just want to match the literal string `", and"`, wherever it appears in the input:
repl:1:> (peg/match ~(any (+ ", and" 1)) "a, b, and c") @[]
Okay. It did work; you just canât tell. Letâs replace it, which will automatically capture the output, so we can at least see that itâs working:
repl:2:> (peg/match ~(any (+ (/ ", and" "; and") 1)) "a, b, and c") @["; and"]
Okay. And now letâs *also* capture everything else:
repl:3:> (peg/match ~(any (+ (/ ", and" "; and") (<- 1))) "a, b, and c") @["a" "," " " "b" "; and" " " "c"]
And weâre done! Sort of! We have the entire modified string, as a list of captures, and all we have to do now is stick them back together.
And I know: this looks *unbelievably* inefficient. And it would be, if we just called, like, `string/concat` on this result. But Janet has a way to efficiently join these matches together without even making these intermediate string allocations in the first place.
Itâs called `accumulate`, although Iâm going to use the short alias `%``:
repl:4:> (peg/match ~(% (any (+ (/ ", and" "; and") (<- 1)))) "a, b, and c") @["a, b; and c"]
And `accumulate` is special-cased in the PEG engine: while Janet is executing the pattern inside an `accumulate` block, anything that would normally push captures onto the stack *instead* just copies it into a shared mutable buffer. And once itâs done with its pattern, that buffer becomes a string, and `accumulate` pushes it onto the capture stack.
So thatâs a *global* replace. But what if you only want to replace the first occurrence?
Hereâs one way:
repl:5:> (peg/match ~(% (any (+ (* (/ ", and" "; and") (<- (to -1))) (<- 1)))) "a, and b, and c") @["a; and b, and c"]
After we match and replace the pattern, we immediately consume the rest of the string, so that the `any` repetition wonât fire again.
Hey look! We did it. `accumulate` was the last combinator on my list of combinators to tell you about, and I just told you about it. That means weâre almost done with the chapter now.
But we get to do something fun and easy first. Thereâs actually another way that we could have written that last pattern:
repl:6:> (peg/match ~(% (any (+ (* (/ ", and" "; and") '(to -1)) '1))) "a, and b, and c") @["a; and b, and c"]
We replaced all of the `(<- x)` captures with just `'x`, which does exactly the same thing. How does that work?
Well, `'x` is actually just syntax sugar for `(quote x)`. They both parse into exactly the same abstract syntax tree: if youâre writing a macro or a PEG engine or whatever else, you canât actually tell whether it was originally written using the `'` shorthand or not. So when the whole thing is quasiquoted:
repl:7:> ~(% (any (+ (* (/ ", and" "; and") '(to -1)) '1))) (% (any (+ (* (/ ", and" "; and") (quote (to -1))) (quote 1))))
All of those single-quotes get expanded into `(quote)` forms, and `quote` is just another alias for `capture` in the PEG parser. But when you use the shorthand, you can save quite a few parentheses.
Now, itâs fun to work through these examples, and I think itâs valuable to understand how they work, just in case you ever find yourself needing to perform some weird text surgery deep inside some complicated PEG. But of course, in real life, you only have to write:
repl:1:> (peg/replace ", and" "; and" "a, b, and c") @"a, b; and c"
There is also `(peg/replace-all)`, `(peg/find)`, which returns the index of the first match; and `(peg/find-all)`, which returns all of the indices where the PEG would match.
Alright. Thatâs all of the important PEG stuff sorted, but I want to close with a few scattered, wandering observations:
So far weâve only talked about parsing text, but you can write PEGs to parse binary formats just as easily. There are even built-in combinators for parsing signed and unsigned big- and little-endian integers up to 64 bits wide, which return the boxed `core/s64` and `core/u64` abstract types.
There are a million helpful âregex testerâ websites that can show you your pattern as a finite state machine or interactively highlight different parts of matches or capture groups. But there is no equivalent for PEGs. If youâre running into trouble with your PEGs, well⊠you basically have to ask about it in the Janet chatroom, I think.
Do you want to make a PEG visualization website? You should. I would use that.
You donât *have* to do this â you can just pass the symbolic expression directly to `peg/match`, as weâve been doing â but if youâre going to use a PEG more than once then itâs probably a good idea. Especially if youâre compiling your program â Janet will marshal an optimized bytecode representation of your PEG into the final image.
Note that `peg/compile` is a *function*, not a macro, so youâll have to remember to call it at the top-level to ensure it executes during the compilation phase. Thereâs no reason to spend time compiling PEGs at runtime, after all, unless youâre dynamically constructing them.
PEGs are just symbolic expressions, and you already know how to write functions that manipulate symbolic expressions.
PEGs really are one of my favorite things about Janet â I have never met a scripting language that made it *so easy* to parse text before.
Chapter Fibe: Concurrency and Coroutines â
If you're enjoying this book, tell your friends about it! A single toot can go a long way.