Tricks to Scaling Distributed Social Networks

As we try to move away from centralised services we run into problems of scaling message distribution. These are not new; RSS/Atom, pingback, OStatus and pump.io were all part solutions to these problems, and we can see how they preceded and evolved into ActivityPub (RSS/Atom is still around, and has a role to play outside of social networks too).

When disseminating messages to followers the two basic schemes are Pull and Push.

Pull

Let's say I have a micro blog (twtxt format, maybe?) and a bunch of people actually -- for some inexplicable reason -- want to read what I have to write. I publish a file on my server, in the twtxt example it's a plain text file, in more common cases it might be an Atom feed. Every now and then my followers fetch this file to see if anything new has happened. There's a delay between me publishing something and my followers receiving it, depending on how often they check said file. Unless I'm a very proliferate writer most of these fetches will yield a zero result; nothing new since last fetch.

In a modern social networking world the asynchronous nature of the delay between publishing and reaching my followers is a lifetime, or at least a gigazillion news cycles. It's already old news.

This scales really well from the perspective of my followers. They can set how often they want to check for updates, and they can follow as many people as they want and tune their settings so as to not be overwhelmed.

From my perspective it doesn't scale at all. If I were to have 500,000 followers my server would be absolutely hammered by requests, most of which are completely useless because I don't post that often. In this case I as a publisher will be greatly helped if all my 500,000 followers are users on one server, because from my perspective the load is similar to just having 1 follower.

Push

If I want to reach my followers more quickly I might opt for a push scheme. They tell me they want to follow me and then leave me alone. And when I publish something, it's sent to all of my followers in some fashion. No zero-result fetches hammering my server, and minimal delay between publishing and reaching my audience. Great!

Except that this scales really badly if most of my followers are on the same server. This happened early on in the process of developing ActivityPub. One specific user had 500,000 followers, and most of those were divided across a handful of instances. Each time this user posted a toot on mastodon, their server essentially subjected the biggest mastodon instances out there to Denial-of-Service attacks. Oops.

This lead to the sharedInbox solution that was shoehorned into the ActivityPub spec at the last minute, and arguably doesn't really fit there at all. A better solution, which I've seen discussed in the right circles as a possible solution going forward, would be for the publisher's server only posting once to each receiving server, but explicitly listing all receivers on that server in the To: field. We'll see if that gains traction.

Why am I Thinking of This?

In my post about possibly making gemini social[1] I didn't really talk about this at all, even though it's been on my mind. I've also done some thinking on it with regards to feeds and aggregators lately. On the first topic, my proposal for making geminispace social is a CGI script, deployed by anyone on any server that supports that. That type of thinking follows the Actor model that ActivityPub (largely) adheres to, but it has no sharedInbox solution because no server would have a single point of contact or even a record of how many users it had that made use of CGI scripts like my proposed one. As such it would be vulnerable to the "half a million followers on one server" problem in the push scheme.

[1] "Making Geminispace Social?"

On the topic of aggregators I think we all know that CAPCOM is no longer all of geminispace. This means that we as users and readers should set up our own way of following our favourites. I predict the number of feed readers and feed aggregators will increase. Most feed readers will probably fetch each feed a few times a day (4 to 12, maybe? I dunno). I don't publish more than once a day, which means that a majority of calls to my capsule index (which is a gemsub feed) or my atom.xml will yield a zero-result. Those calls are for all intents and purposes wasted resources (small, but still wasted). On the other hand; if I post something the second after you fetched my feed and you won't fetch it again for 24 hours... that's a whole lotta news cycles right there.

As I'm writing this post I'm thinking about something I saw on fediverse some time ago: "the trick to scaling is to not".

Many of us in geminispace want to get away from the constant distractions and notifications. Live slower and more intentionally. In that vein I'm planning to set up a feed aggregator for myself, and I'll set it to fetch feeds once every 24 hours. That way I may not feel compelled to check it several times a day for updates either, because I know there won't be any.

But do you know what I really want? I want to go a round every 24 hours to collect new gemlog posts, convert them to epub and upload to my ebook reader in time for breakfast. I haven't quite figured that one out yet, though.

-- CC0 ew0k, 2021-01-15