2022-07-10 | #cgi #news | @Acidus
Today I'm releasing NewsWaffle, a Gemini gateway to (almost) any news websites, allowing you to get lists of current news articles, and read those news articles, all from inside Gemini.
There are already a few news gateways in Gemini, like @sloum's great Geminews, or Jon's mirror of The Guardian
Geminews, featuring NPR, CNN, and CSM
While awesome, these only work with a few specific "text only" or "lite" versions of news sites.
NewsWaffle is different. It works with nearly any news site:
The Jakarta Post, via NewsWaffle
My home-town paper, the Atlanta Journal-Constitution
In fact, you can supply the URL of any news site your want as well:
I like to read news. Specifically technical content. However new websites are increasingly user-hostile, even with an ad-blocker installed:
I wanted to read news sites via Gemini. But I didn't want to have to hard code support for each site I liked. So I needed to write code that could convert arbitrary news websites.
News sites primarily consist of 2 kinds of pages:
To be able to access any news website via Gemini I need to:
Helping this process is embedded meta data like OpenGraph and Twitter Cards. News websites want to make their content look good when shared on social media, so they tend to use these meta data standards (though not always correctly):
I won't lie. Part of me smiles that I am able to use social media nonsense against them.
Once I know the page type, I can move forward. As I discussed in my last post, converting HTML to Gemtext has a lot of challenges, mainly stemming from the structure of modern HTML.
Building a better HTML-to-gemtext converter
Rendering article pages is pretty easy. I use Readability to extract out the article's content. I parse any meta data like OpenGraph, Twitter cards, and old-school <meta> tags, so I can gather semantic information about the content, and run the HTML all through a modified version of Gemipedia's HTML-to-gemtext converter.
Link pages are a little trickier:
That gives me my "All Links" list. Now I want just the links that seem like they point to articles.
Now I have a "Content Links" list, which point to likely news articles, and a list of everything. If a link appears in the Content Links list, I remove it from the "All Links" list. What remains is a list of links that are probably just navigation links.
How do I know they page type, and whether to render a page as a "Link View" or as an "Article View?"
I use some other fuzzing logic to try and guess page type. Whenever I render a webpage as a "Link View, I also give an option to the user to switch it to "Article View" and back again. So even if I'm wrong, the user can quickly get to the right content.
Sure. You can access it via Gemini or HTTP
Because waffles are delicious.