💾 Archived View for thrig.me › blog › 2023 › 05 › 23 › html-quest.gmi captured on 2024-07-09 at 01:04:17. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Summarizing HTML: a quixotic and questionable quest with an overly long subject that has excessive alliteration

A foolish errand might be to try to parse HTML, wherein one may experience (again) that HTML can and does omit closing tags, which may cause your most elegant parser to pear. Also the fancy and clean Object Oriented parser... the code is maybe pretty, but it takes pretty long for it get through the 340,239 characters of the random page linked from some RSS feed.

A not terrible option might be to pipe the HTML through w3m, as w3m does a pretty good job of textifying HTML. This still has problems, as there is no content on the first page of the display, and the handy "skip to main content" link takes you to a "trending" section which, again, has no content. Lucy, with the football. Another page down or three there is the actual content to be had; this is actually not too bad as web pages go. For something like github or reddit I often start paging, my eyes glaze over from all the noise, and suddenly I'm at the bottom of the page having missed what little content there might have been. Also w3m can render things too far indented and unwrapped if there's some table insisting that the text be like that.

Probably one of those expensive AI could summarize the page? But that's even more network and CPU burn.

What sort of gunk reduction are we looking at from a w3m dump or my custom (and certainly buggy) "extract some of the text" parse?

    $ wc -c originalhtml mytextparse w3mdump | sed 3q
      340239 originalhtml
        8410 mytextparse
       11032 w3mdump
    $ wc -c < ~/reference/shakes/richard-iii
      194350

Wintry discontent, much? The advantage of a custom parse is you can get closer to what you want from the document, at the cost of time and watching the code go sideways, a lot. My text takes 10 spacebars to page through, while w3m takes 15 spacebars. This is still bad; the non-menu non-fluff text on the page is only two spacebars (in a standard 80x24 terminal) or about 1398 characters, with various newlines between paragraphs. Critics may correctly point out that some number of characters will need to be dedicated to various forms of advertising if one wants to lose money less quickly against inflation, but how many characters do you really need for that?

Download the Chunky Bloat Browser(TM) today, and we'll throw in a free, that's right, free salad spinner!!

Critics may also complain that HTTP can compress things. Garbage mashers on the detention block, or a quick gzip later,

    $ wc -c *z | sed 3q
        4074 mytextparse.gz
       72919 originalhtml.gz
        5091 w3mdump.gz

This certainly helps, but the HTML still weighs in an order of magnitude high, and that's only for the main page; the average Chunky Bloat Browser(TM) is going to be gorging itself on images, stylesheets, scripts, FBI warnings to install an ad blocker, videos,

A better option is probably to skip the web if you can, but then what use would we have for 'quixotic'?

P.S. The Pinwheel Galaxy had a supernova. Or rather, the light from over there finally got itself over here, a few years later.

P.S.S. There is no free salad spinner.

Boring Parser Details

Mostly that you look for certain tags (which may not have closing tags, lol!!), accumulate random bits of text, and when the certain tags change dump out the accumulated text maybe according to some style for the last certain tag (PRE, BLOCKQUOTE, are we maybe in a list?) you think that you saw. Oh, and maybe some cleanup to trim whitespace and whatnot.

I was pretty unimpressed with CSS Selectors, maybe they make sense if you actually grok CSS and you are doing something HTML-y and not trying to mug text out of a Dom? Disclaimer: according to the CVS respository for the old website it was ~2005 I last really worked with CSS (saying "in anger" would be redundant).