A good example of building parsing expressions on the fly

As part of the regression test, one of the ways I check the results is to scan through the log files, checking the output matches what we expect to happen. The components (and there a quite a few) log KPI (Key Performance Indicator)s in the format:

>
```
stats|F0021: fooreq=123 fooabrts=0 fooerrs=1 foo-qs=2/14 foo-latency=1.2/4.2/15.2ms foo-bar-xlats=3
```

(Each log line from each component contains a unique identifier which makes it easy to grep for particular lines of interest in the logs.)

In the past week, a major change was made in how the KPI are reported (used to be cumulative over the run of the program—now they're reset as they're logged) required a change in how my testing program processed the logs and I decided this was a good time as any to make a dramatic change in how that's done.

Prior, I used the built in Lua string patterns [1] (think regular expressions) for parsing, but as happens, such code tends to be hard to read, and additional processing is required to massage the resulting values from strings to numbers. So I thought, why not use LPeg [2] for my parsing needs?

>
```
-- The rest of the code assumes the following has been declared
local lpeg = require "lpeg"
local Ct = lpeg.Ct -- capture results into a table
local Cc = lpeg.Cc -- matches "", return given value as capture
local Cg = lpeg.Cg -- capture results and assign to given name
local R = lpeg.R -- range capture
local P = lpeg.P -- literal capture
local S = lpeg.S -- character set capture
```

To parse a number, we look for at least one character in the range of “0” to “9” (inclusive), and pass the match to a function that will convert the string to a numeric value, which is returned as the capture:

>
```
local num = R"09"^1 / function(c) return tonumber(c) end
```

For non-integers, we can use the following bit of code—yes, this won't parse all floating point numbers, but we won't need much more than this for what's logged—an optional minus sign, followed by digits, and an optional decimal point plus more digits, which is also translated t a numeric value:

>
```
local fnum = (P"-"^-1 * num * (P"." * num)^-1) / function(c) return tonumber(c) end
```

We also have a few multi-value KPIs, some with two values, and some with three values. These are parsed with the values returned in a table with the values in conveniently named fields:

>
```
local lat = Ct( -- latency
P"no-data" *
Cg(Cc(0),'min') *
Cg(Cc(0),'avg') *
Cg(Cc(0),'max') *
Cg(Cc(false),'valid')
)
+
Ct(
Cg(fnum,'min') * P'/' *
Cg(fnum,'avg') * P'/' *
Cg(fnum,'max') * P'ms' *
Cg(Cc(true),'valid')
)
local cnt = Ct( -- counts
P"no-data" *
Cg(Cc(0),'avg') *
Cg(Cc(0),'max') *
Cg(Cc(false),'valid')
)
+
Ct(
Cg(fnum,'avg') * P'/' *
Cg(fnum,'max') *
Cg(Cc(true),'valid')
)
```

The only wrinkle here (and it's not much of a wrinkle) is that

>
```
foo-qs=no-data foo-latency=no-data
```

can be logged and we need to take that into account. If no-data is seen, the code returns the following table (if we're parsing the three value KPI; the two value KPI returns something similar):

>
```
{
min = 0.0,
avg = 0.0,
max = 0.0,
valid = false
}
```

otherwise, the following will be returned:

>
```
{
min = 1.2,
avg = 4.2,
max = 15.2,
valid = true
}
```

To parse the actual KPIs, I could have done something like:

>
```
F0021 = Ct(
P": " -- I have a reason for starting the parse at the ':'
* P"fooreq=" * Cg(num,'foofeq') * P" "
* P"fooabrts=" * Cg(num,'fooabrts') * P" "
* P"fooerrs=" * Cg(num,'fooerrs') * P" "
* P"foo-qs=" * Cg(cnt,'foo-qs') * P" "
* P"foo-latency=" * Cg(lat,'foo-latency') * P" "
* P"foo-bar-xlats=" * Cg(num,'foo-bar-xlats')
)
```

But I didn't. Not because it's a lot of typing, but because the probability of errors is high because of the repetitious nature of it—I'm repeating the field names twice (look closely and you can see where I messed up). I'd rather not repeat myself [3], especially since I have multiple lines of KPIs to parse. Ideally, I'd like to specify the name of the field once, something like:

>
```
{
{ "fooreq" , num } , -- might as well include
{ "fooabrts" , num } , -- the type of the KPI
{ "fooerrs" , num } , -- while I'm at it
{ "foo-qs" , cnt } ,
{ "foo-latency" , lat } ,
{ "foo-bar-xlats" , num }
}
```

And, because LPeg is composable [4], I can get away with that:

>
```
local function mkmatch(list)
local pattern = P": "
for i = 1 , #list do
pattern = pattern
* P(list[i][1]) -- look for name of field
* P"="
* Cg(list[i][2],list[i][1]) -- capture value as name
* P" "^0 -- optional space at the end
end
return Ct(pattern) -- return all captures in a table
end
F0021 = mkmatch {
{ "fooreq" , num } ,
{ "fooabrts" , num } ,
{ "fooerrs" , num } ,
{ "foo-qs" , cnt } ,
{ "foo-latency" , lat } ,
{ "foo-bar-xlats" , num }
}
```

And there goes the repetition.

Now, why do I start parsing at the colon? Simple, I use the Lua string patterns to quickly check the log lines—that's fairly easy to see what's going on, and from there, I can call the appropriate LPeg pattern to further parse the KPIs.

[1] http://www.lua.org/manual/5.1/manual.html#5.4.1

[2] http://www.inf.puc-rio.br/~roberto/lpeg/

[3] http://c2.com/cgi/wiki?DontRepeatYourself

[4] /boston/2013/01/14.1

Gemini Mention this post

Contact the author