Building a better HTML-to-gemtext converter

2022-07-09 | #gemtext #html | @Acidus

Converting HTML documents into gemtext sounds straightforward:

And sure enough, you can code a basic converter in an hour or so. However HTML-to-gemtext converters that process HTML tags like this, in order, from top to bottom, produce some pretty poor results. The reasons really don't have anything to do with the minimal capabilities of gemtext, and much more to do with modern HTML. Let's see what the major problems are and some ideas I have to make a better converter.

To see first hand the problems I'm talking about, setup your Gemtext client to use my public Gemini-to-HTTP gateway as discussed here:

Stargate: Public Gemini-to-HTTP gateway via Duckling

And try to access the home page of the New York Times:

New York Times (HTTPS)

Rendering content in order

Because gemtext is rendered line-by-line, you want the valuable content near the top of your document, and, where possible, you want to avoid extraneous lines.

HTML doesn't work like this. Content that appears in the bottom of the document may be rendered at the top. Sidebars can appear before the main content. Modern HTML pages have lots markup for low-value content at the top such as markup for menus, navigation links, bread crumbs, login/signup links, drop downs.

Lots of scrolling for The New York Times, rendered with the Duckling Proxy

Most of this HTML should be ignored or skipped, or at the very least deferred and rendered at the very bottom.

Gemtext converters that simply convert HTML tags in order end up creating gemtext documents where the first 30%-50% of the output is gross stuff the user has to scroll past to get to the good content. This is the primary pain point when converting arbitrary HTML pages.

Rendering all content

Lots of interactivity on websites involves hiding and showing content. Captions that appear when you hover on an image. Menus and navigation that slide open. Image carousels or pagers that slide in new content. Controls to expand an image when you hover in certain areas.

All of this invisible content usually exists as markup in the HTML, and CSS is used to selectively hide or reveal it. If you just convert all the HTML, your output will include all of this extra, hidden content. The vast majority of this content is extraneous.

Treating all links the same

Links create a challenging with gemtext. You can only have 1 link per line. So if you blindly convert every hyperlink in an HTML document into to a link line you:

While there are different approaches to handling hyperlinks (such as rendering them at the end of a paragraph) the shear number of links in a typical HTML page is a big problem. The New York Times home page HTML has 185 anchor tags. Converting all of these links, even if you do it well, would create cluttered output.

So what are these links? Webpage hyperlinks fall into a few categories:

Unfortunately, most HTML-to-gemtext converters treat all of these links the same, when clearly they aren't:

To have a better converter, you need to decide what links are important, and what aren't.

Not considering page "type"

Not all webpages are the same. Consider the front page of a news website. The primary purpose of this page is the include a bunch of headlines, snippets of text, and photos, all linking to more in-depth articles. An efficient way to consider this page is as a list of headlines, with links to the article.

If you just convert the HTML on a page like this, you will end up with an odd collection of DIV and P tags, all jumbled up. If you could detect the "type" of page this is, you could just render a list of headlines with links.

Now consider a "content" page, like a blog post or news article. The main content on this page is a series of paragraphs of text. There may be other things like images, block quotes, tables, or lists. But all this content will be in the same rough "block" of content. Surrounding that block will be navigation links, a header, maybe a footer or a sidebar. The most important thing is the block of content.

If you know what "type" of page something is, you can be smarter about how you render it.

Treating all images the same

I don't care about all the images. In fact, most images on a web page are not valuable (site logo, soical media icon on the sharing buttons, ads). I don't want to see them. However, I do want to see the 3 images in the "block" of the content.

Special handling for Meta data

What title should a converter use on its gemtext output? The first H1 tag in the HTML? The TITLE tag? The heading inside the block of content?

Who is the author of this page? How long is the page? When was the page written? In fact, what is the name of this site?

All of this is important information that should probably be formatted in gemtext in a special way to make it easier to read.

Existing HTML-to-Gemtext converters aren't very sophisticated

<div>
	<div>
		<div>
			blah
		</div>
	</div>
</div>

gets rendered as

[blank line]
[blank line]
Blah
[blank line]
[blank line]

Making it Better

Fundamentally, the the biggest challenges of converting HTML to gemtext is that much of the HTML is not valuable, and blindly converting it results in cluttered output that the user has to scroll. To make a better converter we need to:

Luckily, there are several web standards, conventions, and technologies we can leverage to write better HTML-to-gemtext converters: