💾 Archived View for gemini.susa.net › web_readability_cli.gmi captured on 2023-01-29 at 02:27:05. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-11-30)
-=-=-=-=-=-=-
It can be useful to make some web pages available in Gemini space, but HTML is surprisingly difficult to convert to plain text. A lot of heuristics are involved, and they change along with web fashions.
Firefox does a great job of this in Reader mode, and the JavaScript code is available to use. The GitLab user @gardenappl has taken this code and made it into a CLI command to run in NodeJS. The project page is linked below and includes instructions on how to install via npm.
https://gitlab.com/gardenappl/readability-cli
This provides a command 'readable' that can be used to process complex web pages into much simplified HTML that can be converted into text using a utility like 'html2text'.
Note that Firefox doesn't always offer Reader Mode; it sometimes decides it can't reliably extract the text, in which case it simply doesn't offer the Reader Mode option. By default, the 'readable' command follows exactly the same rules (because it's using code taken straight from the Firefox codebase).
However, you can force 'readable' to always extract text by passing the parameter --low-confidence (or just -l) with the value 'force'. See the example below, and 'readable --help' for more details.
Using html2text on this simplified markup generates good results. Using the '-width' parameter with some huge number means that a paragraph will remain on a single line (e.g. html2text -width 4000), which is ideal for gemtext.
Using the latest GitHub version (https://github.com/grobian/html2text) produces a nice 'References' links list at the end of the page. It configured and compiled easily on my Debian 10 system (just ./configure; make; sudo make install).
The configuration file (e.g. ~/.html2textrc) can be used to further improve gemtext compatible output during conversion. For example, the following configuration defines suitable header tag equivalents, and specifies that one blank line should space paragraphs vertically. Note that the baskslash (\) character below is escaping a <SPACE> character in the config file.
H1.prefix = #\ H1.suffix = H2.prefix = ##\ H2.suffix = H3.prefix = ###\ H3.suffix = H4.prefix = ###\ H4.suffix = H5.prefix = ###\ H5.suffix = P.vspace.after = 1
This could all be done in a single pipe, but we'll write temp files to allow viewing of the intermediates.
# Fetch the page to a file, force processing even when low-confidence. readable -l force "https://www.bbc.co.uk/news/business-53337705" >/tmp/readble_test.html # Convert to text with super-wide line width. html2text -links -from_encoding utf8 -width 5000 /tmp/readble_test.html >/tmp/readable_test.gmi # For old version: html2text -utf8 -width 5000 /tmp/readble_test.html >/tmp/readable_test.gmi # We can check the file here, or post-process for further cleanup. cat /tmp/readable_test.gmi
It might be possible to configure html2text to generate even more specific output, see 'man 5 html2textrc' for all configuration options. However, at this point it would be easy enough to just grep or sed the file to remove any lines that are not wanted (e.g. lines with image tag alt-text).
Here's an example of text that was extracted from BBC News, with no extra post-processing. This page was deemed low-confidence by Firefox, meaning it didn't think it would convert well enough. I disagree, it looks more than good enough!
An example page taken from the BBC News website, converted to text.
By default, html2text seems to use the image Alt-Text in place of a link, I presume because it's considered a valid alternative to the image. I'd prefer all images to have links in the references, because it's often important to the document content.
Therefore, I patched the Yacc/Bison code of html2text to output all images as reference links, converting the alt-text to something that can easily be parsed out of the document text to use with the link. It's a little hacky, but it seems to work well enough. The patch is shown below: -
diff --git a/HTMLParser.yy b/HTMLParser.yy index b51b781..27c742e 100644 --- a/HTMLParser.yy +++ b/HTMLParser.yy @@ -819,9 +819,18 @@ special: istr src = get_attribute(attr.get(), "SRC", ""); istr alt = get_attribute(attr.get(), "ALT", ""); /* when ALT is empty, and we have SRC, replace it with a link */ - if (drv.enable_links && !src.empty() && alt.empty()) { + if (drv.enable_links && !src.empty() /* && alt.empty() */) { PCData *d = new PCData; - string nothing = ""; + string nothing = "Image"; + if(!alt.empty()) { + istr alt2 = alt.slice(0, 40); + alt2 += "_(Image)"; + int max = alt2.length(); + for(int idx=0; idx < max; idx++) + if(alt2.get(idx) == ' ') + alt2.replace(idx, 1, '_'); + d->text = alt2; + } else d->text = nothing; list<auto_ptr<Element>> *data = new list<auto_ptr<Element>>; data->push_back(auto_ptr<Element>(d));
After patching the code, you need to run 'make bison-local' before running 'make' again to rebuild with the generated parser files. The makefile requires Bison >= 3.5, so I had to build this from source on a Debian 10.2 box (it's a painless package to build, but needs an 'apt-get install m4', if it's not already there).
This turns the alt-text into a string with spaces changed to underscores (so we can extract it as a single word), and appends "Image" to the text so that we can differentiate easily from normal 'href' links.
I know there's a better way to do this, in particular outputting the alt-text directly in the reference, but it requires better understanding of the html2text code than I have.
#!/bin/bash if [[ "${1}" == "" ]]; then echo "Usage $0: URL <html|text>" exit fi TMPFILE="/tmp/raw_${PPID}" # Check if we're being asked for the HTML or plain-text intermediates... if [[ "${2}" == "html" ]]; then readable --properties title,html-title,html-content -l force ${1} >${TMPFILE}.html echo "File in ${TMPFILE}.html" exit elif [[ "${2}" == "text" ]]; then readable --properties title,html-title,html-content -l force ${1} | \ html2text -links -nobs -from_encoding utf8 -width 5000 -rcfile ./html2textrc >${TMPFILE}.txt echo "File in ${TMPFILE}.txt" exit fi # At this point, we want to fully process the URL to Gemini markup readable --properties title,html-title,html-content -l force ${1} | \ html2text -links -nobs -from_encoding utf8 -width 5000 -rcfile ./html2textrc | \ ./make_gmi.awk
This Awk script figures out a filename from the title and extracts link text for the references. The filename generated is written to stdout.
#!/usr/bin/awk -f # This script takes the output from html2text, assuming there's a Title: line, # and writes the main body of the text to a file with name formed from the # title. # The title is expected to be in the first line fo the file, and any blank # lines between the title and the body of the text will be skipped. BEGIN { in_content = 0 } # Match title heading and form a filename from it. /^Title: / { sub(/^Title: /,"") filename = $0 gsub(/[^A-Za-z0-9-]/,"_", filename) gsub(/_+/, "_", filename) filename = filename ".gmi" print filename next } # Match the links that are listed in the references section /^ +[0-9]++\. https?:\/\// { split($0, arr, " ") link_num = arr[1] gsub(/[^0-9]/, "", link_num) link_text = links[link_num] if(length(link_text) >= 5) print "=> " arr[2] " Link: " arr[1] " " link_text >filename else print "=> " arr[2] " Link: " arr[1] >filename next } # Match link-references in the body of the text, and sore preceding text. /\[[0-9]+\]/ { array_size = split($0, arr, " ") for(i in arr) { idx = match(arr[i], /(.+)\[([0-9]+)\]/, parts) if(idx > 0) { links[parts[2]] = parts[1] } } } # The default handling writes lines to our named file. { # Skip any leading blank lines if(in_content == 0 && $0 == "") next; else in_content = 1 print >filename }