If you are reading this via Gopher and it looks a bit different, that's because I spent the past few hours (months?) working on a new method to render HTML (HyperText Markup Language) into plain text. When I first set this up [1] I used Lynx [2] because it was easy and I didn't feel like writing the code to do so at the time. But I've never been fully satisfied at the results [Yeah, I was never a fan of that either –Editor]. So I finally took the time to tackle the issue (and is one of the reasons I was timing LPEG (Lua Parsing Expression Grammar) expressions [3] [DELETED-the other day-DELETED] [Nope. –Editor][DELETED- … um … the other week-DELETED] [Still nope. –Editor][DELETED- … um … a few years ago?-DELETED] [Last month. –Editor] [Last month? –Sean] [Last month. –Editor] [XXXX this timeless time of COVID-19 –Sean] last month).
The first attempt sank in the swamp. I wrote some code to parse the next bit of HTML (it would return either a string, or a Lua table containing the tag information). And that was fine for recent posts where I bother to close all the tags (taking into account only the tags that can appear in the body of the document, <P>, <DT>, <DD>, <LI>, <THEAD>, <TFOOT>, <TBODY>, <TR>. <TH>, and <TD> do not require a closing tag), but in earlier posts, say, 1999 through 2002, don't follow that convention. So I was faced with two choices—fix the code to recognize when an optional closing tag was missing, or fixing over a thousand posts.
It says something about the code that I started fixing the posts first …
I then decided to change my approach and try rewriting the HTML parser over. Starting from the DTD (Document Type Definition) for HTML 4.01 strict [4] I used the re module [5] to write the parser, but I hit some form of internal limit I'm guessing, because that one burned down, fell over, and then sank into the swamp.
I decided to go back to straight LPEG, again following the DTD to write the parser, and this time, it stayed up.
It ended up being a bit under 500 lines of LPEG code [6], but it does a wonderful job of being correct (for the most part—there are three posts I've made that aren't HTML 4.01 strict, so I made some allowances for those). It not only handles optional ending tags, but the one optional opening tag I have to deal with—<TBODY> (yup—both the opening and closing tag are optional). And <PRE> tags cannot contain <IMG> tags while preserving whitespace (it's not in other tags). And check for the proper attributes for each tag.
Great! I can now parse something like this:
<p>This is my <a href="http://boston.conman.org/">blog</a>. Is this not <em>nifty?</em> <p>Yeah, I thought so.
into this:
tag = { [1] = { tag = "p", attributes = { }, block = true, [1] = "This is my ", [2] = { tag = "a", attributes = { href = "http://boston.conman.org/", }, inline = true, [1] = "blog", }, [3] = ". Is it not ", [4] = { tag = "em", attributes = { }, inline = true, [1] = "nifty?", }, }, [2] = { tag = "p", attributes = { }, block = true, [1] = "Yeah, I thought so.", }, }
I then began the process of writing the code to render the resulting data into plain text. I took the classifications that the HTML 4.01 strict DTD uses for each tag (you can see the <P> tag above is of type block and the <EM> and <A> tags are type inline) and used those to write functions to handle the approriate type of content—<P> can only have inline tags, <BLOCKQUOTE> only allows block type tags, and <LI> can have both; the rendering for inline and block types are a bit different, and handling both types is a bit more complex yet.
The hard part here is ensuring that the leading characters of <BLOCKQUOTE> (wherein the rendered text each line starts with a “| ”) and of the various types of lists (dictionary, unordered and ordered lists) are handled correctly—I think there are still a few spots where it isn't quite correct.
But overall, I'm happy with the text rendering I did, but I was left with one big surprise [7] …
[4] https://www.w3.org/TR/html4/strict.dtd
[5] http://www.inf.puc-rio.br/~roberto/lpeg/re.html