💾 Archived View for thrig.me › blog › 2024 › 11 › 22 › regex.gmi captured on 2024-12-17 at 10:41:08. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
"… and now you have two problems." One problem is when people want to do complicated things with just one regular expression, such as to match some filenames, while excluding (or also including?) certain substrings that may or may not be present. The typical answer is "use two regular expressions"
if ($file =~ m/include/ and not $file =~ m/exclude/) { ...
as the code can then be adapted fairly easily to whatever bent business logic the filenames conform to. Alternate solutions include not having a mess of hard to match filenames, or to perhaps cleanup temporary files before a backup run is done. Maybe build files could be shuffled off to some other directory or device, and thus be made more difficult or impossible for the backup software or whatever to match them. Alas, designing the system betterly is often not an option, so one may (sometimes, often) be stuck with software that only accepts a single regular expression, no code, no considered design of where the files are and how they are named. A pretty typical "troubled brow" situation.
A first thing one should do is to write a minimal test case that includes a (hopefully) good enough sample of good and bad strings, and whether the particular string should or should not match. Even better might be a formal grammar so that one can prove things, but that's more work, and there probably isn't enough time given to the task for that. Without a minimal test case one may easily end up with a sheet too small to cover the bed, or goal-posts that furiously scurry away as new conditionals appear. Much like the doomsday cloak problem, "why didn't you list that" is a typical retort here.
So assume we're trying to match "*.{a,b,c}" files, or those that end with dot a, dot b, or dot c, but only those that lack "bad" somewhere in their name. "Bad" will likely be something else in production, but it's (hopefully) easy to spot during testing. "Nope", "wrong", "ugh", or so forth could also be used as placeholder strings, depending on your mood and how likely the code is to be shared with those who take a dim view of profanity.
#!/usr/bin/env perl # matcher - are test strings matched by some bit of code correctly? use 5.36.0; my $matcher = ...; while ( my $line = readline DATA ) { chomp $line; my ( $expect, $name ) = split ' ', $line, 2; my $got = $matcher->($name); my $prefix = ( $got == $expect ) ? "okay" : "FAIL"; say join "\t", $prefix, $expect, $got, $name; } __DATA__ 1 good.a 1 good.b 1 good.c 0 bad.a 0 foo.bad.b 0 foobadbar.c 0 a.anope
There are good and bad test cases, and the bad ones have some complexity to ensure that the "bad" really is caught anywhere within the string, not only at the beginning of the line or just before the extension, and also that ".a" really only matches at the end of the line. The code that actually does the match has not been written, and will be a function reference as this will make it easier to switch between (and then to benchmark) different implementations.
Our first matcher implementation will use multiple regular expressions,
my $matcher = sub ($name) { if ( $name =~ m/\.[abc]$/ and not $name =~ m/bad/ ) { return 1; } return 0; };
and everything looks okay.
$ perl matcher.pl okay 1 1 good.a okay 1 1 good.b okay 1 1 good.c okay 0 0 bad.a okay 0 0 foo.bad.b okay 0 0 foobadbar.c okay 0 0 a.anope
Next up is our first "one regex" implementation. The solutions that follow are not, strictly, regular expressions, in the original sense of that term, or rather a modern "regular expression" can encompass much more than the original sense did. Such as arbitrary code execution.
sub reval ($name) { if ( $name =~ m/ \.[abc]$ (?{ m!bad! ? 0 : 1 }) /x ) { return $^R; } return 0; } my $matcher = \&reval;
Here the exclude is done within a zero-width code execution assertion: if we match the extension, then check the string under consideration for "bad", and use that return code to accept or fail the string.
A regular expression can also be built up mechanically. Start with "^", then add a positive or negative look-ahead group for each sub-regex, each starting with ".*". This may result in a hilarious amount of backtracking, so one may also want to test against malicious input that attempts to get the regular expression engine into a bad state (out of memory, denial of service because it will take too long to complete, etc). Can attackers supply arbitrary strings?
sub lookaround ($name) { if ($name =~ m/ ^ # step 1: match at beginning (?s) # (allow . to match newlines) (?!.*bad) # step 2: deny "bad" (?=.*\.[abc]$) # step 3: accept "\.[abc]$" /x) { return 1; } return 0; }
Yet another method is to instruct the engine to fail if bad whilst also looking for the good. This may involve less backtracking than the previous lookaround method, though one may want to benchmark the different implementations, and also consider how easy the code is to understand and maintain.
sub instruct ($name) { $name =~ m/bad(*COMMIT)(*FAIL)|\.[abc]$/ ? 1 : 0 }
See also perlre(1) and various other manual pages, in addition to the "Mastering Regular Expressions" book, whose author's name I forget and cannot lookup right now as the internet is down following one of those little wind storms that so dearly vex reliable power delivery in this nation-state.
P.S. the "reval" method is perhaps the worst option here but can be used to return specific codes for specific branches, say if you wanted to vary the exit status depending on what branch matched.
$ echo foo|perl -ne '/foo(?{$s=2})|bar(?{$s=3})/;exit $s';echo $? 2 $ echo bar|perl -ne '/foo(?{$s=2})|bar(?{$s=3})/;exit $s';echo $? 3