And I still haven't found what I'm looking for

If I have any text processing to do, I pretty much gravitate towards using LPeg (Lua Parsing Expression Grammar) [1]. Sure, it might take a bit longer to generate code to parse some text, but it tends to be less “write only” than regular expressions.

Besides, you can do some pretty cool things with it. I have some LPeg code that will parse the following strftime() [2] format string:

>
```
%A, %d %B %Y @ %H:%M:%S
```

and generate LPeg code that will parse:

Tuesday, 03 February 2015 @ 20:59:51

into:

>
```
date =
{
min = 57.000000,
wday = 4.000000,
day = 4.000000,
month = 2.000000,
sec = 16.000000,
hour = 20.000000,
year = 2015.000000,
}
```

Or, if I set my locale [3] correctly, I can turn this:

maŋŋebarga, 03 guovvamánu 2015 @ 21:00:21

into:

>
```
date =
{
min = 0,000000,
wday = 3,000000,
day = 3,000000,
month = 2,000000,
sec = 21,000000,
hour = 21,000000,
year = 2015,000000,
}
```

But one annoyance that hits from time to time—named captures require a constant name. For instance, this pattern:

>
```
pattern = lpeg.Ct(
lpeg.Cg(lpeg.P "A"^1,"class_a")
* lpeg.P":"
* lpeg.Cg(lpeg.P "B"^1,"class_b")
)
```

(translated: when matching a string like AAAA:BBB, return a Lua table [4] (lpeg.Ct()) with the As (lpeg.P()) in field class_a (lpeg.Cg()) and the Bs in field class_b)

applied to this string:

>
```
AAAA:BBB
```

returns this table:

>
```
{
class_a = "AAAA",
class_b = "BBB
}
```

The field names are constant—class_a and class_b. I'd like a field name based on the input. Now, there is a function lpeg.Cb() that is described as:

Creates a back capture. This pattern matches the empty string and produces the values produced by the most recent group capture [5] named name.
Most recent means the last complete outermost group capture with the given name. A Complete capture means that the entire pattern corresponding to the capture has matched. An Outermost capture means that the capture is not inside another complete capture.

“LPeg - Parsing Expression Grammars For Lua [6]”

A quick reading (and I'm guilty of this) leads me to think this:

>
```
pattern = lpeg.Cg(P"A"^1,"name")
* lpeg.P":"
* lpeg.Ct(lpeg.P "B"^1,lpeg.Cb("name"))
```

applied to the string:

>
```
AAAA:BBB
```

returns

>
```
{
AAAA = "BBB"
}
```

But sadly, no. The only example of lpeg.Cb(), used to parse Lua long strings (which start with a “[”, zero or more “=”, another “[”, then text, ended with a “]”, zero or more “=” (but the number of “=” must equal the number of “=” between the two “[”) and a final “]”)):

>
```
equals = lpeg.P"="^0
open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1
close = "]" * lpeg.C(equals) * "]"
closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end)
string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1
```

shows that lpeg.Cb() was designed with this partular use case in mind—matching one pattern with the same pattern later on, and not what I want.

I can do what I want (a field name based upon the input) but the way to go about it is very klunky (in my opinion):

>
```
pattern = lpeg.Cf(
lpeg.Ct("")
* lpeg.Cg(
lpeg.C(lpeg.P"A"^1)
* lpeg.P":"
* lpeg.C(lpeg.P"B"^1)
)
,function(acc,name,value)
acc[name] = value
return acc
end
)
```

This is a “folding capture [7]” (lpeg.Cf()) where we are accumulating our results (even though it's only one result—we have to do it this way) in a table (lpeg.Ct()) where each “value” is a group (lpeg.Cg()—the name is optional) consisting of a collection (lpeg.C() of As (lpeg.P()) followed by a colon (ignored), followed by a collection of Bs, all of which (except for the colon—remember, it's ignored) are passed to a function that assigns the string of Bs to a field name based on the string of As.

It gets even messier when you mix fixed field names with ones based upon the input. If all the field names are defined, it's easy to do something like:

>
```
eoln = P"\n" -- match end of line
text = (P(1) - eoln)0 -- match anything but an end of line
pattern = lpeg.Ct(
P"field_one: " * Cg(text^0,"field_one") * eoln
* P"field_two: " * Cg(text^0,"field_two") * eoln
* P"field_three:" * Cg(text^0,"field_three") * eoln
)
```

against data like this:

>
```
field_one: Lorem ipsum dolor sit amet
field_two: consectetur adipiscing elit
field_three: Cras aliquet enim elit
```

to get this:

>
```
{
field_one = "Lorem ipsum dolor sit amet",
field_two = "consectetur adipiscing elit",
field_three = "Cras aliquet enim elit"
}
```

But if we have some defined fields, but want to accept non-defined field names, then … well … yeah … I haven't found a nice way of doing it. And I find it annoying that I haven't found what I'm looking for.

[1] http://www.inf.puc-rio.br/~roberto/lpeg/

[2] http://pubs.opengroup.org/onlinepubs/009695399/functions/strftime.html

[3] http://en.wikipedia.org/wiki/Locale

[4] http://www.lua.org/manual/5.3/manual.html#2.1

[5] http://lucy/~spc/docs/LPeg-0.12/lpeg.html#cap-

[6] http://lucy/~spc/docs/LPeg-

[7] http://lucy/~spc/docs/LPeg-

Gemini Mention this post

Contact the author