🧇 NewsWaffle: Read any news website, all via Gemini

2022-07-10 | #cgi #news | @Acidus

Today I'm releasing NewsWaffle, a Gemini gateway to (almost) any news websites, allowing you to get lists of current news articles, and read those news articles, all from inside Gemini.

🧇 NewsWaffle

There are already a few news gateways in Gemini, like @sloum's great Geminews, or Jon's mirror of The Guardian

Geminews, featuring NPR, CNN, and CSM

The Guardian mirror

While awesome, these only work with a few specific "text only" or "lite" versions of news sites.

NewsWaffle is different. It works with nearly any news site:

Wired, via NewsWaffle

The Verge, via NewsWaffle

The Jakarta Post, via NewsWaffle

My home-town paper, the Atlanta Journal-Constitution

In fact, you can supply the URL of any news site your want as well:

Provide a new website URL

Features:

Why build NewsWaffle?

I like to read news. Specifically technical content. However new websites are increasingly user-hostile, even with an ad-blocker installed:

I wanted to read news sites via Gemini. But I didn't want to have to hard code support for each site I liked. So I needed to write code that could convert arbitrary news websites.

Structure of a News Website.

News sites primarily consist of 2 kinds of pages:

To be able to access any news website via Gemini I need to:

How NewsWaffle works

Helping this process is embedded meta data like OpenGraph and Twitter Cards. News websites want to make their content look good when shared on social media, so they tend to use these meta data standards (though not always correctly):

I won't lie. Part of me smiles that I am able to use social media nonsense against them.

Once I know the page type, I can move forward. As I discussed in my last post, converting HTML to Gemtext has a lot of challenges, mainly stemming from the structure of modern HTML.

Building a better HTML-to-gemtext converter

Rendering article pages is pretty easy. I use Readability to extract out the article's content. I parse any meta data like OpenGraph, Twitter cards, and old-school <meta> tags, so I can gather semantic information about the content, and run the HTML all through a modified version of Gemipedia's HTML-to-gemtext converter.

Link pages are a little trickier:

That gives me my "All Links" list. Now I want just the links that seem like they point to articles.

Now I have a "Content Links" list, which point to likely news articles, and a list of everything. If a link appears in the Content Links list, I remove it from the "All Links" list. What remains is a list of links that are probably just navigation links.

How do I know they page type, and whether to render a page as a "Link View" or as an "Article View?"

I use some other fuzzing logic to try and guess page type. Whenever I render a webpage as a "Link View, I also give an option to the user to switch it to "Article View" and back again. So even if I'm wrong, the user can quickly get to the right content.

Let me see the code!

Sure. You can access it via Gemini or HTTP

NewsWaffle on Github

Why did you name it NewsWaffle?

Because waffles are delicious.