Managing a fleet of NixOS Part 1 - Design choices

Author: Solène
Date: 02 September 2022
Tags: bento nixos nix

Introduction

I have a grand project in my mind, and I need to think about it before starting any implementation. The blog is a right place for me to explain what I want to do and the different solutions.

It's related to NixOS. I would like to ease the management of a fleet of NixOS workstations that could be anywhere.

This could be useful for companies using NixOS for their employees, to manage all the workstations remotely, but also for people who may manage NixOS systems in various places (cloud, datacenter, house, family computers).

In this central management, it makes sense to not have your users with root access, they would have to call their technical support to ask for a change, and their system could be updated quickly to reflect the request. This can be super useful for remote family computers when they need an extra program not currently installed, and that you took the responsibility of handling your system...

With NixOS, this setup totally makes sense, you can potentially reproduce users bugs as you have their configuration, stage new changes for testing, and users can roll back to a previous working state in case of big regression.

Cachix company made it possible before I figure a solution. It's still not late to propose an open source alternative.

Cachix Deploy

Defining the project

The purpose of this project is to have a central management system on which you keep the configuration files for all the NixOS around, and allow the administrator to make the remote NixOS to pick up the new configuration as soon as possible when required.

We can imagine three different implementations at the highest level:

a scheduled job on each machine looking for changes in the source. The source could be a git repository, a tarball or anything that could be used to carry the configuration.
NixOS systems could connect to something like a pub/sub and wait for an event from the central management to trigger a rebuild, the event may or not contain information / sources.
the central management system could connect to the remote NixOS to trigger the build / push the build

These designs have all pros and cons. Let's see them more in details.

Solution 1 - Scheduled job

In this scenario, The NixOS system would use a cron or systemd timer to periodically check for changes and trigger the update.

Pros

low maintenance
could interactively ask the user when they want to upgrade if not now

Cons

may not run at all if the system is not up at the correct time, or could be run at a delayed time depending on situation
can't force an update as soon as possible
not really bandwidth effective if you often poll
no feedback from the central management about who made/receive the update (except by adding a call to the server?)

Solution 2 - Remote systems are listening for changes (publisher / subscriber)

In this scenario, the NixOS system would always be connected to the central management, using some kind of protocol like MQTT, BOCH or similar.

Pros

you know which systems are up
events from central management are instantaneous and should wait for an acknowledgment
updates should propagate very quickly
could interactively ask the user when they want to upgrade if not now

Cons

this can lead to privacy issue as you know when each host is connected
this adds complexity to the server
this adds complexity on each client
firewalls usually don't like long-lived connections, HTTPS based solution would help bypass firewalls

Solution 3 - The central management pushes the updates to the remote systems

In this scenario, the NixOS system would be reachable over a protocol allowing to run commands like SSH. The central management system would run a remote upgrade on it, or push the changes using tools like deploy-rs, colmena, morph or similar...

Awesome-nix list: deployment-tools

Pros

update is immediate
SSH could be exposed over TOR or I2P for maximum firewall bypassing capability

Cons

offline systems may be complicated to update, you would need to try to connect to them often until they are reachable
you can connect to the remote machine and potentially spy the user. In the alternatives above, you can potentially achieve the same by reconfiguring the computer to allow this, but it would have to be done on purpose

Making a choice

I tried to state the pros and cons of each setup, but I can't see a clear winner. However, I'm not convinced by the Solution 1 as you don't have any feedback or direct control on the systems, I prefer to abandon it.

The Solutions 2 and 3 are still in the competition, we basically ended with a choice between a PUSH and a PULL workflow.

Conclusion

In order to choose between 2 and 3, I will need to experiment with the Solution 2 technologies as I never used them (MQTT, RabbitMQ, BOCH etc…).