Managing a fleet of NixOS Part 2 - A KISS design

Author: Solène
Date: 03 September 2022
Tags: bento nixos nix

Introduction

Let's continue my series trying to design a NixOS fleet management.

Yesterday, I figured out 3 solutions:

1. periodic data checkout

2. pub/sub - event driven

3. push from central management to workstations

I retained solutions 2 and 3 only because they were the only providing instantaneous updates. However, I realize we could have a hybrid setup because I didn't want to let the KISS solution 1 away.

In my opinion, the best we can create is a hybrid setup of 1 and 3.

A new solution

In this setup, all workstations will connect periodically to the central server to look for changes, and then trigger a rebuild. This simple mechanism can be greatly extended per-host to fit all our needs:

periodicity can be configured per-host
the rebuild service can be triggered on purpose manually by the user clicking on a button on their computer
the rebuild service can be triggered on purpose manually by a remote sysadmin having access to the system (using a VPN), this partially implements solution 3
the central server can act as a binary cache if configured per-host, it can be used to rebuild each configuration beforehand to avoid rebuilding on the workstations, this is one of Cachix Deploy arguments
using ssh multiplexing, remote checks for the repository can have a reduced bandwidth usage for maximum efficiency
a log of the update can be sent to the sftp server
the sftp server can be used to check connectivity and activate a rollback to previous state if you can't reach it anymore (like "magic rollback" with deploy-rs)
the sftp server is a de-facto available target for potential backups of the workstation using restic or duplicity

The mechanism is so simple, it could be adapted to many cases, like using GitHub or any data source instead of a central server. I will personally use this with my laptop as a central system to manage remote servers, which is funny as my goal is to use a server to manage workstations :-)

File access design

One important issue I didn't approach in the previous article is how to distribute the configuration files:

each workstation should be restricted to its own configuration only
how to send secrets, we don't want them in the nix-store
should we use flakes or not? Better to have the choice
the sysadmin on the central server should manage everything in a single git repository and be able to use common configuration files across the hosts

Addressing each of these requirements is hard, but in the end I've been able to design a solution that is simple and flexible:

Design pattern for managing users

The workflow is the following:

the sysadmin writes configuration files for each workstation in a dedicated directory
the sysadmin creates a symlink to a directory of common modules in each workstation directories
after a change, the sysadmin runs a program that will copy each workstation configuration into a directory in a chroot, symlinks have to be resolved
OPTIONAL: we can dry-build each host configuration to check if they work
OPTIONAL: we can build each host configuration to provide them as a binary cache

The directory holding configuration is likely to have a flake.nix file (can be a symlink to something generic), a configuration file, a directory with a hierarchy of files to copy as-this in the system to copy things like secrets or configuration files not managed by NixOS, and a symlink to a directory of nix files factorized for all hosts.

The NixOS clients will connect to their dedicated users with ssh using their private key, this allows to separate each client on the host system and restrict what they can access using the SFTP chroot feature.

A diagram of a real world case with 3 users would look like this:

Real world example with 3 users

Work required for the implementation

The setup is very easy and requires only a few components:

a program to translates the configuration repository into separate directories in the chroot
some NixOS configuration to create the SFTP chroots, we just need to create a nix file with a list of pair of values containing "hostname" "ssh-public-key" for each remote host, this will automate the creation of the ssh configuration file
a script on the user side that connects and look for changes and run nixos-rebuild if something changes, maybe rclone could be used to "sync" over SFTP efficiently
a systemd timer for the user script
a systemd socket triggering the user script, so people can just open http://localhost:9999 to trigger the socket and forcing the update, create a bookmark named "UPDATE MY MACHINE" on the user system

Conclusion

I absolutely love this design, it's simple, and each piece can easily be replaced to fit one's need. Now, I need to start writing all the bits to make it real, and offer it to the world 🎉.

There is a NixOS module named autoUpgrade, I'm aware of its existence, but while it's absolutely perfect for the average user workstation or server, it's not practical for managing a fleet of NixOS efficiently.