xj — HTML to JSON

This, xj, is a Unix filter that reads XML (or permissively parses HTML) and outputs JSON. Perfect for piping directly into jq, gron or json2tsv.

jq

gron

json2tsv

Usage

wget -qO- https://stedolan.github.io/jq/|xj|jq '..|select(.title?)[][]'

Description

I put it together but it's just a tiny bit of glue code that uses the HTML parser and the output combinators both made by Alex Shinn.

HTML parser

output combinators

This is just an early release and there's a pretty big bug currently: tabs in the input document contains are not being escaped properly and will cause jq to crash. Hoping to fix that in a future release.

Formal Semantics

Elements are objects with one key, the element name, and the value is an array with that elements children, or an empty array if there aren't any. (This is to disambiguate elements from text data.)

Iff there are any attributes, an attibute object is listed first among the children, disambiguated from the other children by having a "@" key. The attributes are not in a list, they can be accessed directly.

In XML, an element can have several children with the same name, and in turn have grandchildren. But the same isn't true for attributes which is why it can have simpler semantics.

Building

Get the source at git clone https://idiomdrottning.org/xj and to build it on Debian and derivatives, do

apt install chicken-bin
chicken-install fmt html-parser srfi-1 utf8
csc -O5 xj.scm

Remove the -O5 when you're hacking.