If you are reading this via Gopher and it looks a bit different,
that's because I spent the past few hours (months?) working on a new method to render HTML into plain text.
When I first set this up I used Lynx because it was easy and I didn't feel like writing the code to do so at the time.
But I've never been fully satisfied at the results [Yeah, I was never a fan of that either –Editor].
So I finally took the time to tackle the issue
(and is one of the reasons I was timing LPEG expressions the other day
[Nope. –Editor] … um … the other week
[Still nope. –Editor] … um … a few years ago?
[Last month. –Editor]
[Last month? –Sean]
[Last month. –Editor]
[XXXX this timeless time of COVID-19 –Sean]
last month).
The first attempt sank in the swamp.
I wrote some code to parse the next bit of HTML
(it would return either a string,
or a Lua table containing the tag information).
And that was fine for recent posts where I bother to close all the tags
(taking into account only the tags that can appear in the body of the document,
<P>
, <DT>
, <DD>
, <LI>
, <THEAD>
, <TFOOT>
, <TBODY>
, <TR>
. <TH>
, and <TD>
do not require a closing tag),
but in earlier posts,
say, 1999 through 2002,
don't follow that convention.
So I was faced with two choices—fix the code to recognize when an optional closing tag was missing,
or fixing over a thousand posts.
It says something about the code that I started fixing the posts first …
I then decided to change my approach and try rewriting the HTML parser over.
Starting from the DTD for HTML 4.01 strict I used the re
module
to write the parser,
but I hit some form of internal limit I'm guessing,
because that one burned down,
fell over,
and then sank into the swamp.
I decided to go back to straight LPEG, again following the DTD to write the parser, and this time, it stayed up.
It ended up being a bit under 500 lines of LPEG code,
but it does a wonderful job of being correct
(for the most part—there are three posts I've made that aren't HTML 4.01 strict,
so I made some allowances for those).
It not only handles optional ending tags,
but the one optional opening tag I have to deal with—<TBODY>
(yup—both the opening and closing tag are optional).
And <PRE>
tags cannot contain <IMG>
tags while preserving whitespace
(it's not in other tags).
And check for the proper attributes for each tag.
Great! I can now parse something like this:
<p>This is my <a href="http://boston.conman.org/">blog</a>. Is this not <em>nifty?</em> <p>Yeah, I thought so.
into this:
tag = { [1] = { tag = "p", attributes = { }, block = true, [1] = "This is my ", [2] = { tag = "a", attributes = { href = "http://boston.conman.org/", }, inline = true, [1] = "blog", }, [3] = ". Is it not ", [4] = { tag = "em", attributes = { }, inline = true, [1] = "nifty?", }, }, [2] = { tag = "p", attributes = { }, block = true, [1] = "Yeah, I thought so.", }, }
I then began the process of writing the code to render the resulting data into plain text.
I took the classifications that the HTML 4.01 strict DTD uses for each tag
(you can see the <P>
tag above is of type block
and the <EM>
and <A>
tags are type inline
)
and used those to write functions to handle the approriate type of content—<P>
can only have inline
tags,
<BLOCKQUOTE>
only allows block
type tags,
and <LI>
can have both;
the rendering for inline
and block
types are a bit different,
and handling both types is a bit more complex yet.
The hard part here is ensuring that the leading characters of <BLOCKQUOTE>
(wherein the rendered text each line starts with a “| ”)
and of the various types of lists (dictionary, unordered and ordered lists) are handled correctly—I think there are still a few spots where it isn't quite correct.
But overall, I'm happy with the text rendering I did, but I was left with one big surprise …