2023-04-16 - CAPCOM changes

This month I am working on some substantial changes to the way the public CAPCOM instance hosted at gemini.circumlunar.space works. The "big picture" whereby each month CAPCOM selects 100 random feeds from all those it knows about to aggregate for that month will not change. Therefore most of the changes I'm about to describe will not become visible until May. However, there are some details worth knowing in advance because some people can act upon them now. In approximate order of excitingness:

CAPCOM will soon poll feeds three times daily, every eight hours (as opposed to four times even six hours).

CAPCOM will now follow redirects and in particular will recognise permanent redirects (status code 31) and update its records accordingly so that the new URL is used directly for future requests.

The underlying CAPCOM software has been updated to support not only Atom feeds but also Gemini pages which are subscribable as per the companion spec. If the hassle of setting up an Atom feed has been keeping you off CAPCOM all these years, but you have a "gemsubbable" Gemlog, you can now submit your URL (right now, today, if you like!) and starting from May it will be eligible for inclusion in the 100 feeds.

Most importantly, CAPCOM will become a lot smarter about "self-tending" the set of feeds it samples from. Instead of a simple flat text file full of URLs, CAPCOM now stores the feeds it knows about in a small SQLite database. Each feed is at any given point in time considered either "active" or "inactive". The 100 feeds for each month are randomly sampled from only the active subset. Feeds will be automatically moved back and forth between the two categories as follows.

Before each month's sampling, a maintenance script attempts to visit all active feeds. If the script cannot successfully load a feed for any reason - e.g. it gets a response with an error status code or the connection times out or the DNS lookup of the hostname fails - that is recorded in the database as a "strike". Once an active feed has received three strikes, i.e. when a feed has not been accessible on the first day of three consecutive calendar months, it is reclassified as inactive. If an active feed can successfully be fetched but the most recent update has a timestamp which is more than six months old, it is also reclassified as inactive.

Being reclassified as inactive is not a one way trip. The maintenance script also checks inactive feeds for signs of new life, but does so according to an exponential back-off schedule. Once a feed first become inactive, the maintenance script will check it again after one month. If it can be successfully downloaded and there is a post less than six months old, the feed will be "resurrected" to active status and once again eligible for its place amongst the monthly 100 feeds. If the inactive feed still cannot be fetched due to network errors or contains no fresh content, it will next be checked after two months. Then after four. Then eight months, sixteen months, thirty two months, and so on. There is no upper limit; CAPCOM never truly forgets a feed, but practically speaking it will eventually schedule its next checkup for further into the future than the feed's author is likely to still be alive, which is basically giving up on that feed.

I hope that in this way CAPCOM will start to respond to the natural ebb and flow of activity in Geminispace in an organic way without any manual attention from either me or from capsule authors. If a feed is permanently abandoned, CAPCOM won't beat its head against that bad URL forever (as it no doubt has been for some dead feeds in recent years), but the feed will be migrated to the inactive list and stay there, so that it will not decrease the odds of actively updated feeds being sampled each month. Eventually, neither time nor bytes will be wasted checking on that feed. If somebody stops updating their capsule for somewhere between six and twelve months for whatever reason but then returns, CAPCOM will notice, set that feed aside for a while to give others more of a chance at exposure, but will then start regularly checking in on the newly active feed again after a modest delay. Even a hiatus of between one and two years will be recovered from with a delay which, while longer than some might consider ideal, is not longer than the hiatus itself. It is not a big problem that CAPCOM might take several months to realise that a previously inactive feed has come back to life; even if it noticed the return immediately, it could easily still take a few months before the feed happened to get sampled as one of the monthly 100. Naturally, feeds which maintain a regular and consistent posting schedule for years on end will remain in the active list for years on end and are likely to get sampled many times. Such a regular feed can still drop off the network for a month or so due to hardware failure or network outages without any lasting consequence.

CAPCOM looks quiet these days because many of the feeds it knows about were created during the explosion of interest in Gemini in 2020 but have since disappeared or fallen silent. When it chooses 100 feeds at random, it often scoops up only a fairly small number of active ones. This situation should change in the coming months. When the new maintenance script runs for the first time on May 1st, all those silent feeds which are still online but have not been updated recently will be immediately flagged as inactive. Those which have gone offline will fail to be fetched in May, June and July, then it's three strikes and they will be out too. If, over those same coming months, people who have capsules with subscribable index pages but no Atom feeds submit their URLs for inclusion, then by the second half of 2023 we should see a much rejuvenated CAPCOM with more activity and variety than it has shown for a long time.

Beyond simply being another bit of Project Gemini housekeeping, these changes are also an attempt at leading by example. Does your Gemini software distinguish between temporary and permanent redirects, updating its knowledge of where to find things after being told that something has relocated for good? Or does it just keep following the redirect forever, increasing latency for your end users by making two requests when one would do, forcing capsule authors to keep redirects configured forever and ever? Does your Gemini search engine or archiver unthinkingly fetch every resource it knows about on a fixed schedule with no regard for whether the content has changed? Gemini's fanatical simplicity and stubborn minimalism with regard to what kind of information is and is not included in requests and responses puts it at something of a disadvantage when it comes to minimising unnecessary traffic. We can't make conditional requests and we can't ask when a resource was last updated. But we are not helpless and we are certainly not forced to make bad decisions. We can still write smart software which learns from experience and adapts to changes. This matters more for some projects than others, and for some applications frequent polling might really be necessary, but it would make me very happy if Gemini became known as a place where the default approach to large, long-lived, public services based on automated requests was to think carefully, tread lightly and plan for the long term.