In a recent thread on Mastodon I mentioned some command line tools for working with HTML. Basically: convert it to XML and use the XML command line tools `xmllint` and `xmlstarlet` (see XMLStarlet, the xmllint man page and the libxml page).
In order to download from the web we will use `curl` (see the curl homepage) and in order to display HTML we use `w3m` with the `-T text/html` option (see the w3m homepage).
The first thing to know is that `xmllint` can *parse HTML* using the `--html` option and evaluate XPath using the `--xpath` option.
curl --silent https://alexschroeder.ch/ | xmllint --html --xpath '//h1/text()' -
The result is “Moved Permanently” (because you’re being redirected to the wiki).
If you want to explore stuff *interactively*, `xmllint` comes with a built in shell that treats the DOM as a directory tree!
curl --silent https://alexschroeder.ch/wiki > a.html xmllint --html --shell a.html
You can now use commands such as these:
cd html/body/div/h1 cat a/text() grep Diary du /html/body/div[1]
It’s amazing!
If you want to *edit HTML* you need a different tool, though. This fetches a page, converts to XML, deletes the H1, converts back to HTML, and displays it:
curl --silent https://alexschroeder.ch/ \ | xmllint --html --dropdtd - \ | xmlstarlet ed --delete //h1 \ | xmllint --htmlout - \ | w3m -T text/html
We need to drop the DTD or `xmlstarlet` will complain.
#XML #Shell