2017-08-11 XML and the Command Line

In a recent thread on Mastodon I mentioned some command line tools for working with HTML. Basically: convert it to XML and use the XML command line tools `xmllint` and `xmlstarlet` (see XMLStarlet, the xmllint man page and the libxml page).

on Mastodon

XMLStarlet

xmllint man page

libxml page

In order to download from the web we will use `curl` (see the curl homepage) and in order to display HTML we use `w3m` with the `-T text/html` option (see the w3m homepage).

curl homepage

w3m homepage

The first thing to know is that `xmllint` can *parse HTML* using the `--html` option and evaluate XPath using the `--xpath` option.

XPath

curl --silent https://alexschroeder.ch/ | xmllint --html --xpath '//h1/text()' -

The result is “Moved Permanently” (because you’re being redirected to the wiki).

If you want to explore stuff *interactively*, `xmllint` comes with a built in shell that treats the DOM as a directory tree!

DOM

curl --silent https://alexschroeder.ch/wiki > a.html
xmllint --html --shell a.html

You can now use commands such as these:

cd html/body/div/h1
cat a/text()
grep Diary
du /html/body/div[1]

It’s amazing!

If you want to *edit HTML* you need a different tool, though. This fetches a page, converts to XML, deletes the H1, converts back to HTML, and displays it:

curl --silent https://alexschroeder.ch/ \
| xmllint --html --dropdtd - \
| xmlstarlet ed --delete //h1 \
| xmllint --htmlout - \
| w3m -T text/html

We need to drop the DTD or `xmlstarlet` will complain.

​#XML ​#Shell