Building a better HTML-to-gemtext converter

2022-07-09 | #gemtext #html | @Acidus

Converting HTML documents into gemtext sounds straightforward:

Parse HTML into a DOM tree
Visit each tag, and it's children
For a small subset of tags, you can convert them into Gemtext (e.g. <p> become a new line with content, <h2> tags become "## [some text]", <pre> becomes ```, etc).

And sure enough, you can code a basic converter in an hour or so. However HTML-to-gemtext converters that process HTML tags like this, in order, from top to bottom, produce some pretty poor results. The reasons really don't have anything to do with the minimal capabilities of gemtext, and much more to do with modern HTML. Let's see what the major problems are and some ideas I have to make a better converter.

To see first hand the problems I'm talking about, setup your Gemtext client to use my public Gemini-to-HTTP gateway as discussed here:

Stargate: Public Gemini-to-HTTP gateway via Duckling

*UPDATE 2023-02-22**: I've replaced Duckling with my own proxy, which uses my own custom HTML-to-gemtext converter that attempt to solve many of the issues presented below.

And try to access the home page of the New York Times:

New York Times (HTTPS)

Rendering content in order

Because gemtext is rendered line-by-line, you want the valuable content near the top of your document, and, where possible, you want to avoid extraneous lines.

HTML doesn't work like this. Content that appears in the bottom of the document may be rendered at the top. Sidebars can appear before the main content. Modern HTML pages have lots markup for low-value content at the top such as markup for menus, navigation links, bread crumbs, login/signup links, drop downs.

Lots of scrolling for The New York Times, rendered with the Duckling Proxy

Most of this HTML should be ignored or skipped, or at the very least deferred and rendered at the very bottom.

Gemtext converters that simply convert HTML tags in order end up creating gemtext documents where the first 30%-50% of the output is gross stuff the user has to scroll past to get to the good content. This is the primary pain point when converting arbitrary HTML pages.

Rendering all content

Lots of interactivity on websites involves hiding and showing content. Captions that appear when you hover on an image. Menus and navigation that slide open. Image carousels or pagers that slide in new content. Controls to expand an image when you hover in certain areas.

All of this invisible content usually exists as markup in the HTML, and CSS is used to selectively hide or reveal it. If you just convert all the HTML, your output will include all of this extra, hidden content. The vast majority of this content is extraneous.

Treating all links the same

Links create a challenging with gemtext. You can only have 1 link per line. So if you blindly convert every hyperlink in an HTML document into to a link line you:

Dramatically increase the vertical height of the page
Impact the readability of the content, since links have to appear on their own line

While there are different approaches to handling hyperlinks (such as rendering them at the end of a paragraph) the shear number of links in a typical HTML page is a big problem. The New York Times home page HTML has 185 anchor tags. Converting all of these links, even if you do it well, would create cluttered output.

So what are these links? Webpage hyperlinks fall into a few categories:

Navigational links to other sections of site
Links to "content" pages (articles, blog posts, etc)
Links to external websites
Links to media/model dialogs to show images/high resolution images
Links to other parts of the page (using fragments)

Unfortunately, most HTML-to-gemtext converters treat all of these links the same, when clearly they aren't:

Gemtext doesn't support fragments, so fragment links to content on the same page can be ignored
Navigation links take you somewhere else, so displaying navigational links before providing actual page content is a bad experience. Why show the user a bunch of links to go somewhere else, before showing them the actual page content?
A common pattern on websites is to have an image, and when you click the image, you get a higher resolution version of that image. Since IMG tags are usually converted to link lines in gemtext, these creates multiple links (and thus multiple lines) pointing to the same thing, taking up valuable vertical space.
Links to external websites often are there for supplemental value. For example, you link to a source document, or the home page for person, or to a youtube video. Supplemental content is helpful, because it supports what you are reading, but it's often not important enough to disrupt the reading experience.

To have a better converter, you need to decide what links are important, and what aren't.

Not considering page "type"

Not all webpages are the same. Consider the front page of a news website. The primary purpose of this page is the include a bunch of headlines, snippets of text, and photos, all linking to more in-depth articles. An efficient way to consider this page is as a list of headlines, with links to the article.

If you just convert the HTML on a page like this, you will end up with an odd collection of DIV and P tags, all jumbled up. If you could detect the "type" of page this is, you could just render a list of headlines with links.

Now consider a "content" page, like a blog post or news article. The main content on this page is a series of paragraphs of text. There may be other things like images, block quotes, tables, or lists. But all this content will be in the same rough "block" of content. Surrounding that block will be navigation links, a header, maybe a footer or a sidebar. The most important thing is the block of content.

If you know what "type" of page something is, you can be smarter about how you render it.

Treating all images the same

I don't care about all the images. In fact, most images on a web page are not valuable (site logo, soical media icon on the sharing buttons, ads). I don't want to see them. However, I do want to see the 3 images in the "block" of the content.

Special handling for Meta data

What title should a converter use on its gemtext output? The first H1 tag in the HTML? The TITLE tag? The heading inside the block of content?

Who is the author of this page? How long is the page? When was the page written? In fact, what is the name of this site?

All of this is important information that should probably be formatted in gemtext in a special way to make it easier to read.

Existing HTML-to-Gemtext converters aren't very sophisticated

Whenever they encounter a block-level tag, they create a newline in the gemtext. So:

<div>
	<div>
		<div>
			blah
		</div>
	</div>
</div>

gets rendered as

[blank line]
[blank line]
Blah
[blank line]
[blank line]

They convert all content, even content that is hidden with CSS.
They contain bugs, and output raw HTML for things like <iframe>, <style>, or <script type="template">
They don't understand modern HTML tags like <picture> or <figure>, so they render a lot of bogus content.

Making it Better

Fundamentally, the the biggest challenges of converting HTML to gemtext is that much of the HTML is not valuable, and blindly converting it results in cluttered output that the user has to scroll. To make a better converter we need to:

Separating valuable content that is valuable from not valuable
Understanding which hyperlinks are valuable and separate navigational links from content links.

Luckily, there are several web standards, conventions, and technologies we can leverage to write better HTML-to-gemtext converters:

HTML5 has semantic elements like <article>, <header>, <footer>, and <nav> which can help you find determine primary content from secondary or navigational content.
Modern tags like <figure> and <figcaption> allow much better conversion of images into a link line.
CSS classes and style attributes to detect and skip hidden and invisible content.
Use ARIA roles to identify invisible and unneeded HTML and skip it.
Intelligence can be applied to how links are handled to reduce their number (e.g. Have we seen the link before? Have we seen the link text before? Is it a link to the same site? Is it just a fragment link?)
RSS or Atom feed referenced in HTML can give us a list of valuable links to content,
Meta data like OpenGraph and JSON-LD can give us information like the type of page, the name of the site, the canonical title of the page, and information about authors, feature images, and publication dates.
Readability: Readability is a library that powers most of the "Reader View" features in browsers and apps. It extracts out just the text of an article. This is a great way to find "valuable" content.