Not This Next Character

One regular expression trick is to use "not this next character" or here "[^b]+"; this avoids various pitfalls of ".+b" (greedy) or ".+?b" (not greedy). A downside is that you need to have a character to match on that never appears in what is being skipped over.

    $ perl -E 'say "aaaabut not these bytes" =~ /^(.+)b/   '
    aaaabut not these
    $ perl -E 'say "aaaabut not these bytes" =~ /^(.+?)b/  '
    aaaa
    $ perl -E 'say "aaaabut not these bytes" =~ /^([^b]+)/ '
    aaaa

The greedy matches too much. What else is different between the regular expressions, besides "(.+)" matching more? A microbenchmark might be in order.

notnextchar.pl

    $ perl notnextchar.pl
                Rate dotplqu notnext dotplus
    dotplqu 330288/s      --    -51%    -51%
    notnext 674633/s    104%      --     -1%
    dotplus 678510/s    105%      1%      --
    $ perl notnextchar.pl
                Rate dotplqu dotplus notnext
    dotplqu 328949/s      --    -50%    -50%
    dotplus 657112/s    100%      --     -1%
    notnext 661254/s    101%      1%      --
    $ perl notnextchar.pl
                Rate dotplqu dotplus notnext
    dotplqu 330735/s      --    -51%    -53%
    dotplus 669584/s    102%      --     -4%
    notnext 697967/s    111%      4%      --

The non-greedy "(.+?)b" is slow, while "not this next character" ranks up with the greedy "(.+)b". The above uses perl 5.36 on OpenBSD 7.4 on my current laptop; the details will vary over time and implementation and hardware.

Another thing to keep an eye on is how a regular expression will fail to match; the above all anchor the match to the beginning of the string so will not waste time hunting around elsewhere for matches. Bad non-matching input may cause a regular expression to behaveā€¦ poorly, assuming arbitrary malicious input can get to the pattern match: the code may become hilariously slow, or there may be a security issue if an attacker can get a suitably crafted log over to something in fail2ban such that arbitrary IP addresses are blacklisted, or shell commands inserted into system(3) type calls. Or the regular expression engine itself might have a security flaw.

A Second Opinion

How a different regular expression engine behaves might be good to know. PCRE is pretty close to what Perl does, but I like to call it Like Perl But Different (LPBD).

notnextchar.lisp

And there are different results from SBCL 2.3.8.openbsd.sbcl-2.3.8 and cl-ppcre-20230618-git:

    ^(.+)b    26.126 seconds of real time
    ^(.+?)b   76.833 seconds of real time
    ^([^b]+)  43.901 seconds of real time

The greedy match is fast, though the "not this next character" is not as fast as in perl 5.36. The non-greedy remains slower than "not this next character", so probably should be used where possible. Other regular expression engines may need similar tests to see how they behave.

tags #regex

/blog/2022/11/17/regular-expression-alternation.gmi

bphflog links

bphflog index

next: Random Line From File