Parsers vs. regular expressions? No contest

I'm finding that where once in Lua [1] to its regular expressions for parsing, I am now turning to LPeg [2]—or rather, the re [3] module, as I find it easier to understand the code once written.

For instance, the regression test program I wrote for work outputs the results of each test:

>

```

1.a.8 0-0 16-17 scp: ASREQ (1) ASRESP (1) LNPHIT (1) SS7COMP (1) SS7XACT (1) tps: ack-cmpl (1) cache-searches (1) cache-updates (1) termreq (1)

```

Briefly, the first three fields are the test case ID, and indications if certain data files changed. The scp field indicates which variables of the SCP (Service Control Point) (you can think of this as a service on a phone switch) were modified (these just happen to be in uppercase) and then the tps field indicates which TPS (Transaction Processor Service) (our lead developer does have a sense of humor [4]) were modified. But if a variable is added (or removed—it happens), the order can change and it makes checking the results against the expected results a bit of a challenge.

The result is some code to parse the output and check against the expected results. And for that, I find using the re module for parsing:

>

```

local re = require "re"

G = [[

line <- entry -> {}

entry <- {:id: id :} %s

{:seriala: serial :} %s

{:serialb: serial :} %s

'scp:' {:scp: items* -> {} :} %s

'tps:' {:tps: items* -> {} :}

id <- %d+ '.' [a-z] '.' %d+

serial <- %d+ '-' %d+

items <- %s* { ([0-9A-Za-z] / '-')+ %s '(' %d+ ')' }

]]

parser = re.compile(G)

```

to be more understandable than using Lua-based regular expressions:

>

```

function parse(line)

local res = {}

local id,seriala,serialb,tscp,ttps = line:match("^(%S+)%s+(%S+)%s+(%S+)%s+scp%:%s+(.*)tps%:%s+(.*)")

res.id = id

res.seriala = seriala

res.serialb = serialb

res.scp = {}

res.tps = {}

for item in tscp:gmatch("%s*(%S+%s%(%d+%))%s*") do

res.scp[#res.scp + 1] = item

end

for item in ttps:gmatch("%s*(%S+%s%(%d+%))%s*") do

res.tps[#res.tps + 1] = item

end

return res

end

```

with both returning the same results:

>

```

{

scp =

{

[1] = "ASREQ (1)",

[2] = "ASRESP (1)",

[3] = "LNPHIT (1)",

[4] = "SS7COMP (1)",

[5] = "SS7XACT (1)",

},

id = "1.a.8",

tps =

{

[1] = "ack-cmpl (1)",

[2] = "cache-searches (1)",

[3] = "cache-updates (1)",

[4] = "termreq (1)",

},

serialb = "16-17",

seriala = "0-0",

}

```

Personally, I find regular expressions to be an incomprehensible mess of random punctuation and letters, whereas the re module at least lets me label the parts of the text I'm parsing. I also find it easier to see what is happening six months later if I have to revisit the code.

Even more importantly, this is a real parser. Would you ranther debug a regular expression for just validating an email address [5] or a grammar that validates all defined email headers [6] (email address validation starts at line 464)?

[1] http://www.lua.org/

[2] http://www.inf.puc-rio.br/~roberto/lpeg/

[3] http://www.inf.puc-rio.br/~roberto/lpeg/re.html

[4] http://www.youtube.com/watch?v=Fy3rjQGc6lA

[5] http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

[6] https://github.com/spc476/LPeg-Parsers/blob/master/email.lpeg

Gemini Mention this post

Contact the author