💾 Archived View for acidic.website › musings › npr-bridge.gmi captured on 2023-05-24 at 17:39:33. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2020-09-24)
-=-=-=-=-=-=-
Date: 2020-06-09
State: Done
Released a portal to NPR Text News at:
Source code available at:
I had been wanting to consume news through Gemini for some time now, and thought that NPR's text site was a good target because the HTML it returned was already fairly lean. The scraper itself is a CGI app, which makes upstream HTTP requests to NPR URLs and converts the response into text. I used Python for this scraper because I felt the library ecosystem was just that much better with Python over Tcl, and my familiarity with the language is a lot higher, since I've written a lot of Python (among other languages) in my job.
I've been writing a few scrapers recently (I've been trying to write Gopher scrapers so I can consume sites I regularly browse through Gopher on my local network and avoid incurring network hops), so I have some experience writing scrapers. One of the best tricks is to see if you can hook into aspects of the upstream data model. In this case, the NPR site relies on story IDs (and topic IDs, but I ended up not using those for anything) to uniquely address stories. This means that I can pass around story IDs internally and use those to fetch stories. In fact, my story CGI script takes an NPR story ID as input and returns a text rendered version as a response.
The codebase that we run at work is on Python 3.6 (there shouln't be any major blockers moving it to Python 3.7), so I don't often get to play with the newest Python features. I am normally fairly lukewarm on the language. Python itself is a fun language, but every library seems to like to create its own fluent API, which means as a developer, I need to learn an entirely new set of abstractions every time I learn a new library. I also hate Python's requests with a passion (despite having used it here), because it's one of those libraries that is nice to use for the easy case, but feels like pulling teeth when you need to do something complicated with it. I'll reserve a future rant on Python for time to spend railing on the language, but instead I got to play around with these new Python features:
When I had used mypy and its optional typing earlier, the project was still young so a lot of work had to be done by hand to write stubs to get guarantees out of anything. Now, the feature has come a long way, and sure enough as type systems tend to do, helped me catch a lot of bugs without ever running my code. Dataclasses were my sleeping favorite, as it imports a pattern from functional languages that I really enjoy and mimics a pattern I often use in my own code by subclassing Namedtuple. I only discovered dataclasses partly through writing the scraper, so I'll be converting some objects to it later on.
Scraping always makes me feel a little "scummy", as I'm consuming a resource I feel I may not have been meant to consume, so I want to make sure I am as courteous about it as possible. I'll be adding in a rate limit to my app to make sure I don't make too many upstream requests, but I did put caching in place. This way, if any of the folks behind the 30-50 unique non-CGI requests I get per-day decide to try and overwhelm my little box (please don't!) or saturate NPR through me, they'll just be receiving cached data. I decided a while back that if I were to deploy scrapers on this box, I would be caching data and storing this data in Redis. For CGI this poses a problem, because a new process is spawned on every request, and opening TCP connections to Redis becomes quite expensive and dominates the resource consumption and time taken when spawning. I decided instead, because Redis will only be living on the box itself, to use a Unix socket to interact with Redis. This means that, while I would need to "connect" to this socket, this connection would be about as expensive as opening a file (and maybe even less expensive). On any remote HTTP call, I first attempt to grab it from the cache, and if that fails, I make an upstream HTTP request, process it as necessary, and store this result in a buffer (a Python StringIO to be exact). The scraper grabs a lock (it tries to set the value of a key, but fails if this key already exists), writes the response into a key, and then deletes the lock key. Redis keys are limited to 2GB, so I'm fairly confident that all responses will fit in a Redis key.
I tried to write the scraper with Redis as a purely optional dependency; I use config to determine whether or not I will actually import or use redis. While in other situations I would reach for a TOML parser and parse the config from a TOML file, I realized for a technical audience that is at home making changes to DWM (that is, if there is even an audience for this scraper besides me and this capsule), keeping config inside a Python file should be fine. If more configuration options are added down the road, I will move the config into its own Python file, but I do not think it is worth the complexity to bring in a TOML parser in order to parse an external config file.
There are a bunch of small things I'd like to fix from the ergonomics perspective, especially as related to reading, hitting, and missing in the cache. I also need to write a better Readme so folks can understand which parts of the scraper do what. I'm also right now doing some hand parsing, by making an Enum based state machine, but I want to explore using PEGs to parse NPR stories instead of my bespoke parsing solution. Hopefully that can decrease my maintenance burden while also making the code faster.
Big news! Gemini space has a new spec now! I'm a big fan of the "11" status code for sensitive input, and will start using it for my weather CGI app once either Elpher or Darwaza support it (betting on Elpher since I am the dev for Darwaza 😅). I'll be starting on a scraper/interface for Lobsters pretty soon now, so I'm a fan of the quoting characters added into the spec. I'm not sure I really agree with the stipulation that clients should not make network requests on behalf of a user. At this point in time, I feel like this is "overly moralizing", though I understand the concerns that went into this recommendation (and have been roughly following mailing list chatter). I don't, personally, intend on any clients I write to make network requests without asking the user or at least offering the user an opt-in setting, but I do like the flexibility to change the experience *if the user so desires*. Regardless, there's been a lot of ongoing discussion about these standards, and I feel a bit guilty sometimes that I'm not offering more of an input, despite writing some software for Gemini space. I solace myself by saying that Gemini needs content as much as it needs standards, so that's what I'll work on for now.
Before writing any more apps, I need to make a better deployment strategy for my capsule other than just editing files over Emacs Tramp/SSH. I'm making backups of my content, but I really would like to have a staging/production workflow so I can stage a change to my capsule, test it out, then push it to production. As such, I may be exploring the Fossil SCM tool. Fossil has been on my list for a while now, and this might be my first chance.