💾 Archived View for thrig.me › blog › 2023 › 04 › 19 › plaintext.gmi captured on 2024-05-10 at 11:31:54. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

plaintext > *

    <g ui>: ok the latest version is at https://docs.google.com/spreadsheets/d/1_vkiwq0IOIJPqZTiomzd4ApUSEQXhEY6CeyZD_6c-PA/edit#gid=449828459
    <g ui>: ye, exchanging old plain text files as the way to go

The gibberish URL doubtless points to some JavaScript Application... until Google shutters the service, as they are wont to do. Then what? With plain text files there are higher odds there's a copy of the document not walled off. And who knows how to escape data from a JavaScript Application. Rumor has it one can write an exporter?

The times and the needs are not so desperate, yet, because the content can also be found at:

https://jbotcan.org/lojban/en/how-the-enemy-came-to-thlunrana/

This is less problematic. Notice how much more information there is in this URL, compared to the first. And the data it contains is much more accessible, even if there is some archaeology involved,

    ...
    <noscript data-reactid="4">
            <iframe src="//www.googletagmanager.com/ns.html?id=GTM-KF5MXGP"
                height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript></div><div id="react-google-tag-manager-gtm" data-reactid="5"><script data-reactid="6">
            (function(w,d,s,l,i){w[l]=w[l]||[];
                w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js', });
                var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';
                j.async=true;j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;
                f.parentNode.insertBefore(j,f);
            })(window,document,'script','dataLayer','GTM-KF5MXGP');</script>
    ...

and somewhere below the JavaScript latrine layer we find

    ...
    eactid="34"/><br data-reactid="35"/><!-- react-text: 36 -->Even if you can&#x27;
    t read Lojban reading the English original could be elucidating as well.<!-- /re
    act-text --></p></div><div class="content-panel translation kuku" data-reactid="
    37"><table dir="ltr" data-reactid="38"><colgroup data-reactid="39"><col data-rea
    ctid="40"/><col data-reactid="41"/><col data-reactid="42"/><col data-reactid="43
    "/></colgroup><tbody data-reactid="44"><tr data-reactid="45"><th data-reactid="4
    6">ni&#x27;o lisri le nu le bradi cu klama la tlunranan.</th><th data-reactid="4
    7">How the enemy came to Thlunrana</th><th data-reactid="48">ni&#x27;o sei lisri
     be me&#x27;e lu le su&#x27;u le bradi mo&#x27;u klama la tlunranan. se&#x27;u</
    ...

a table. This could have been worse. There might have been snakes.

Granted, one could simply stare at the rendered HTML. That might work. But what if you want to hide the Russian column, or only see the simple lojban column, or have the columns open in parallel files in your text editor. How would you do that? Add more code to an already too bloated browser?

Quick and Dirty

XPath is one way to extract things from HTML. Searching for only the <TABLE> tags and all that they contain we find that the junk or noise ratio of this document isn't too bad. Except for all those "react" things? Not sure why those exist.

    $ xpquery -p HTML '//table' how-the-enemy-came-to-thlunrana.html > x
    $ wc -c how-the-enemy-came-to-thlunrana.html x
       34298 how-the-enemy-came-to-thlunrana.html
       20787 x
       55085 total

contains xpquery, somewhere

CSS selectors are an alternative to XPath, though as you might tell my web education has been gently neglected. So with XPath we want all the text from the individual columns of the table, or, obviously

    $ ln -s how-the-enemy-came-to-thlunrana.html y
    $ alias xpqh='xpquery -p HTML'
    $ xpqh '//tr/td[position()=1]/text()' y > simple
    $ xpqh '//tr/td[position()=2]/text()' y > english
    $ xpqh '//tr/td[position()=3]/text()' y > fancy-jbo

to select the text of the Nth <TD> child of each <TR> of whatever tables are present, which hopefully is good enough--the HTML could be borked, or there might be multiple tables, or who knows what. In that case your dig may be a bit more expensive and time consuming, or will include some amount of junk or missing data.

    $ sed -n \$p simple | fmt -n | sed -n \$p
    gi'e ku'i xabju vi le remna

A quick check shows that the simple file agrees with what's in the HTML as rendered however poorly by w3m. Good enough?

/archive/lojban/how-the-enemy-came-to-thlunrana/

learn you some ancient xpath for great good

Actually, no, the above is not good enough; selecting only on <TD> misses the <TH> at the top of the table, which bears the title of the text. So either two queries will need to be made, first for //tr/th... and then the usual //tr/td... appending to the result file; or, the queries can be combined into a more complicated expression.

    $ xpqh '//tr/*[self::th or self::td][position()=1]/text()' y > simple
    $ xpqh '//tr/*[self::th or self::td][position()=2]/text()' y > english
    $ xpqh '//tr/*[self::th or self::td][position()=3]/text()' y > fancy-jbo

Even More Complicated

"AST. Very dangerous. You go first."

More complicated would be to use some library designed to parse and give API access to HTML table contents, but that's more work, and would take me away from from the original point of this exercise, which was to study a text in lojban with as few useless distractions as possible: no other columns, no JavaScript, no Google, no writing yet another blog post.

tags #lojban #xpath