💾 Archived View for dioskouroi.xyz › thread › 29397799 captured on 2021-12-03 at 14:04:38. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Pne-more-re-nightmare – A fast regex compiler in Common Lisp

Author: rurban

Score: 57

Comments: 22

Date: 2021-11-30 21:35:51

Web Link

________________________________________________________________________________

reikonomusha wrote at 2021-12-01 08:33:31:

This is demonstrative of one of the great superpowers of Lisp: that you can write mini native code compilers—without knowing a lick of assembly. (But knowing assembly certainly helps you optimize the heck out of it.)

More specifically, you can compile syntax (of your choosing) into performant, low-level Lisp code, which the Lisp compiler can then compile into efficient machine code. To do this, you need no extra tools, no intermediate files, no compiler hooks, no virtual machines, no byte codes, and no non-standard or non-portable features. It has been built in to Lisp since at least the 1980s.

Making miniature "JIT" compilers is also a technique a quantum simulator in Lisp [0a, 0b] uses to optimize quantum gate operations. It analyzes the quantum gate matrix and, on the fly, compiles it into optimized numerical Lisp code that is equivalent to that matrix acting on vectors of tensor product spaces (i.e., super huge arrays of complex double floats). Using another language would require me to do arduous things like write assembly code or learn a library like LibJIT [1] which may not even interface well with my host language.

Other simulators out there written in C use huge hand-unrolled loops with comments like

// THIS CODE WAS GENERATED, DO NOT TOUCH

and/or in-line assembly code, and still can't optimize specific matrix shapes and structures, or do algebraic simplifications to eliminate work altogether.

The regex library FTA is a great, and clean, example of a long standing practice of compiling regexen, except it doesn't use any fancy VMs or any fancy JITs, just "when you see this regex, automatically turn it into this Common Lisp code, and let the Lisp compiler handle the rest."

[0a]

https://github.com/quil-lang/qvm

[0b] COMPILE-OPERATOR:

https://github.com/quil-lang/qvm/blob/master/src/compile-gat...

[1]

https://www.gnu.org/software/libjit/

martincmartin wrote at 2021-12-01 15:19:15:

Perhaps most importantly, a good CL implementation can help with debugging this, by keeping track of the original CL code that generated the intermediate code, and pointing you to it. With the "use a program to generate C" approach, if there's some bug in the C code, the best the compiler or runtime can do is point you to the line of C code that's crashing. It's up to you to figure out where that came from.

This was a problem for early C++ compilers like Cfront that translated C++ into C: the error messages from the C compiler referenced the intermediate C, not the original C++.

gmfawcett wrote at 2021-12-01 21:27:00:

How old is the #line directive, I wonder? It's been a long time since Cfront.

https://stackoverflow.com/questions/2216540/whats-the-meanin...

i_am_proteus wrote at 2021-12-01 15:37:47:

I express my appreciation for the author's reference to King Crimson - the track "One More Red Nightmare" off of the record _Red_.

irq wrote at 2021-12-01 21:24:33:

I noticed this too and was hoping the author would mention it in his article but couldn't find a reference to it :) This is a great album to listen to while doing technical work.

hayley-patton wrote at 2021-12-01 22:39:23:

See the first footnote: "The name is of course a reference to One More Red Nightmare."

phoe-krk wrote at 2021-12-01 20:24:47:

This title has a typo - should be _One-more-re-nightmare_.

rurban wrote at 2021-12-02 07:21:15:

I'm pretty sure I submitted it properly. Strange. Dang, could you fix it?

froh wrote at 2021-12-01 09:27:37:

As this is a regex->DFA implementation I wonder how it compares to Google RE2 [1]

there are nice performance comparisons to glibc, rust regex and cl-ppcre, but re2 was missing from that list.

[1]

https://github.com/google/re2

hyperpape wrote at 2021-12-01 22:32:26:

If the benchmarks are to be believed, I think it substantially outperforms re2, because it also beats Hyperscan, and I think Hyperscan typically beats re2.

alexott wrote at 2021-12-01 21:44:37:

re2 has limited functionality. It’s works wel for specific patterns, like crawling, but not general purpose

zeruch wrote at 2021-12-01 21:54:18:

I'm sensing a subtle King Crimson reference here...

ogogmad wrote at 2021-12-01 10:18:17:

What about derivative-based regex?

I got worried at some point that some of the purported advantages of derivative-based regexes might be drawbacks. One such advantage/drawback is that it adds the _complement_ operation into the language of Regex. But that gives you a way of deciding whether two regexes are equivalent, which is supposed to be PSPACE-COMPLETE, which ought to severely hamper finding an efficient implementation.

hayley-patton wrote at 2021-12-01 10:41:09:

Author here - I don't think this is an issue.

If you don't use submatching, you can just compute the minimal DFAs and compare them to test for equivalence, and you don't even need negation. Thus, introducing negation can't do any harm there.

But as I generate Mealy machines for submatching, which can't be minimised, we need another approach. Knowing that A - B = A ∧ ¬B, one can produce a DFA for (A - B) ∨ (B - A), and if it has no accepting states, then A and B must be equivalent, as nothing can match one but not both. But DFA construction is O(2ⁿ), even if one takes the RE → NFA → DFA route, so this is nothing new complexity-wise.

Such a decision procedure is never used in a derivative-based regex implementation, anyway. The equivalence of regular expressions only needs to be approximated, and the approximation consists of some structural rules (see Definition 4.1 of [1]). Relatively few rules are needed to guarantee that a _finite_ machine can be generated, and only a dozen rules are required to get very close to a minimal DFA (table 1, page 14 of [1]). So having the possibility of writing a precise regex equivalence function is completely irrelevant to producing an efficient regex engine.

A decision procedure is handy, but you can write one for submatch-less regexes without negation, the complexity of such a procedure is typical for DFA construction, and the problem of deciding if two regular expressions are equivalent is irrelevant.

[1] Regular-expression derivatives reexamined

https://www.ccs.neu.edu/home/turon/re-deriv.pdf

kazinator wrote at 2021-12-01 22:47:20:

The complement operation doesn't provide a way of deciding whether two regexes are equivalent, nor is any such thing required to implement it. Using derivatives to interpret a regular expression that contains a complement operator does not require a function for calculating equivalence of two regular expressions.

Equivalence is required for compiling. The reason is that regular expressions have features like the Kleene closure which express iteration, and these give rise to endless derivatives: regular expressions whose derivative you can keep taking _ad nauseum_. This is fine for interpreting, because the algorithm terminates when you have consumed the input or hit an impasse. The compiler, however, has to close the loops and turn them into cycles in the state machine graph. To do that it has to check each derivative: "have I derived an expression which has occurred before?" If so, then just hook to the previous state.

If you make the check literal, just comparing the regex syntax for identical structure, the graph will sometimes be larger than necessary, because it will contain redundant nodes that should be equivalent states. The smarter you make the equivalence function, the more compact the graphs will be.

jwr wrote at 2021-12-01 08:11:42:

This is a good example of why I roll my eyes at most programming language discussions. There is always someone that starts the "language X is SLOW" bandwagon, and people hop on.

Languages are rarely "slow" or "fast" by themselves. Programs written in those languages can be slow or fast.

simion314 wrote at 2021-12-01 08:59:50:

>Languages are rarely "slow" or "fast" by themselves.

For sure they are, language features will for sure limit the optimizations a compiler/interpreter can do and for sure a language with many levels of abstractions will have to pay the price in performance somewhere.

Every-time someone improves the benchmark performance of a slow language that person is not a regular developer, but someone that has a deeper understanding on what happens under the hood, knows how to use profiling tools, maybe knows assembly and most of the time they have to use tricks or advanced patterns.

Btw, optimizing some slow language in a hot path with clever tricks or some advanced patterns is fine, it happens all the time, you can;t just switch the project language with 1 click.

the-alt-one wrote at 2021-12-01 09:35:35:

>language features will for sure limit the optimizations a compiler/interpreter can do

For example:

Common Lisp has a full numerical tower, meaning it will automatically convert fixed size numbers into bignums (and more cool stuff). This means that an addition has to branch if it can't prove that a result won't overflow from fixnum to bignum. Yes, C also needs to use bignums, but they're on an opt-in basis. In CL they're opt-out.

There are more examples of this, often related to the dynamism of your language. The more introspection a language is capable of, the less a compiler designer can do to cheat and optimize your program.

regularfry wrote at 2021-12-01 11:06:57:

Yes and no. Some language features can make it incredibly difficult to optimise. I know the JRuby folks had all sorts of fun trying to make method dispatch fast in the face of "well, this method definition can be changed at any time, arbitrarily often, including during execution of the method itself". I know CL must also have solved that problem but I presume the toolkit is sufficiently different that the techniques just don't map.

pfdietz wrote at 2021-12-01 22:02:00:

A nice mechanism for optimizing such things was given at the last European Lisp Symposium.

https://european-lisp-symposium.org/static/proceedings/2021....

(the last paper, by Robert Strandh.)

AllegedAlec wrote at 2021-12-01 11:16:28:

I also feel that for 99% of applications, it _doesn't matter_. Like, yeah, there are a a few areas where hyper-optimized code in some kind of near-bare-metal language is necessary. But most of the time, if you're writing code, the language itself isn't gonna be the deciding factor if something is performant enough for your application. Furthermore, by using a "fast" language, you're then trading your ability to quickly and cleanly implement code that's human readable suffers greatly.

hajile wrote at 2021-12-02 06:11:28:

There are fast and slow high level languages. JS is at least as high-level as Python, but is WAY faster.

Common Lisp is interesting in that it is high-level while lots of implementations (like sbcl) also retaining the ability to drop into lower levels (even assembly) if lots of performance is needed.