🧇 NewsWaffle: Read any news website, all via Gemini

2022-07-10 | #cgi #news | @Acidus

Today I'm releasing NewsWaffle, a Gemini gateway to (almost) any news websites, allowing you to get lists of current news articles, and read those news articles, all from inside Gemini.

🧇 NewsWaffle

There are already a few news gateways in Gemini, like @sloum's great Geminews, or Jon's mirror of The Guardian

Geminews, featuring NPR, CNN, and CSM

The Guardian mirror

While awesome, these only work with a few specific "text only" or "lite" versions of news sites.

NewsWaffle is different. It works with nearly any news site:

Wired, via NewsWaffle

The Verge, via NewsWaffle

The Jakarta Post, via NewsWaffle

My home-town paper, the Atlanta Journal-Constitution

In fact, you can supply the URL of any news site your want as well:

Provide a new website URL

Features:

Read (almost) any news site.
Automatically builds a list of news stores, separate from the navigational hyperlinks.
Detects RSS/Atom feeds to provide a more accurate list of news stories.
Uses Readability to show only article content on article pages.
Uses meta data like OpenGraph or Twitter cards to provide richer formatting, and to determine page type.
Uses a modified version of Gemipedia's HTML-to-gemtext library, so it supports images, tables, lists, block quotes, etc.

Why build NewsWaffle?

I like to read news. Specifically technical content. However new websites are increasingly user-hostile, even with an ad-blocker installed:

Cookie consent boxes with dark patterns
Annoying popups to subscribe to newsletters.
Nag-walls limiting access.
Auto-playing videos.
Sticker headers or footers that cover the content.

I wanted to read news sites via Gemini. But I didn't want to have to hard code support for each site I liked. So I needed to write code that could convert arbitrary news websites.

Structure of a News Website.

News sites primarily consist of 2 kinds of pages:

Link Pages: These are pages like the home page or a topic page. The primary purpose of these pages is they include a bunch of headlines, snippets of text, and photos, all linking to more in-depth articles. An efficient way to consider this page is as a list of headlines, with links to articles.
Article Pages: These are the pages with the actual content. A series of paragraphs, images, block quotes, lists, and tables.

To be able to access any news website via Gemini I need to:

Detect if a page is a link page or an article page .
For a link page, extract the links to news articles, but ignore the navigational links.
For an article page, extract only the article content, meta data, and format it nicely.

How NewsWaffle works

Helping this process is embedded meta data like OpenGraph and Twitter Cards. News websites want to make their content look good when shared on social media, so they tend to use these meta data standards (though not always correctly):

HTML with an "og:type" of "website" tend to be link pages
HTML with an "og:type" of "article" tend to be article pages
Meta data is used to determine the proper title, site owner, copyright info to display, feature image, and more.

I won't lie. Part of me smiles that I am able to use social media nonsense against them.

Once I know the page type, I can move forward. As I discussed in my last post, converting HTML to Gemtext has a lot of challenges, mainly stemming from the structure of modern HTML.

Building a better HTML-to-gemtext converter

Rendering article pages is pretty easy. I use Readability to extract out the article's content. I parse any meta data like OpenGraph, Twitter cards, and old-school <meta> tags, so I can gather semantic information about the content, and run the HTML all through a modified version of Gemipedia's HTML-to-gemtext converter.

Link pages are a little trickier:

I fetch all the hyperlinks on the page.
I discard anything that doesn't have link text, or if it points to an external site, or is just an anchor.
I deduplicate the links, and if 2 anchors point to the same URL, I use the one with the longer link text

That gives me my "All Links" list. Now I want just the links that seem like they point to articles.

Remove from the DOM things that look like navigation. <header> tags, <nav> tags, <footer> tags, funny names in classes, etc
Fetch all the hyperlinks again, with the same criteria as above. I also discard any where the link text is less than 4 words. This helps filter out any links to categories, or authors, or tags, etc.

Now I have a "Content Links" list, which point to likely news articles, and a list of everything. If a link appears in the Content Links list, I remove it from the "All Links" list. What remains is a list of links that are probably just navigation links.

How do I know they page type, and whether to render a page as a "Link View" or as an "Article View?"

I use some other fuzzing logic to try and guess page type. Whenever I render a webpage as a "Link View, I also give an option to the user to switch it to "Article View" and back again. So even if I'm wrong, the user can quickly get to the right content.

Let me see the code!

Sure. You can access it via Gemini or HTTP

NewsWaffle on Github

Why did you name it NewsWaffle?

Because waffles are delicious.