💾 Archived View for d.moonfire.us › blog › 2024 › 03 › 21 › switching-ceph-to-seaweedfs captured on 2024-05-10 at 10:33:44. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
At the end of 2023, I realized that I was running out of space on my home [[Ceph]] cluster and it was time to add another node to it. While I had space for one more 3.5" drive in one of my servers, I was feeling a little adventurous and decided to get a DeskPi Super6C[1], a [[Raspberry Pi]] CM4, a large NVMe drive, and try to create a new node that way.
1: https://deskpi.com/collections/deskpi-super6c
Well, over the following few months, a lot of mistakes were made that are worthy of a dedicated post. But, when most of those problems were resolve, I encountered another series of “adventures” which led me to switching out my home's Ceph cluster for a [[SeaweedFS]] one.
Around the time I was working on the Pi setup, my [[NixOS]] flake was unable to build the `ceph` packages. Part of this is because I was working off unstable and so the few weeks of being unable to build meant I couldn't get Ceph working on the new hardware. I tried even compiling it myself, which takes about six hours on my laptop and longe because I had to remote build on the Pi itself since I have yet to figure out how to get [[Colmena]] to build `aarch64` on my laptop.
Also, I was dreading setting up Ceph since I remember how many manual steps I had to do to get the OSDs working on my machines. While researching it, I was surprised to see my blog post on it[2] was on the wiki page[3], which is kind of cool and a nice egoboo.
2: /blog/2022/12/10/ceph-and-nixos/
3: https://nixos.wiki/wiki/Ceph
There was a PR[4] on Github for using the Ceph-provided OSD setup that would have hopefully alleviated it. That looked promising, so I was watching that PR with interest because I was right at the point of needing it.
4: https://github.com/NixOS/nixpkgs/pull/281924
Sadly, that PR ended up being abandoned for a “better” approach. Given that it takes me six hours to build Ceph, I couldn't really help with that approach which meant I was stuck waiting unless I was willing to dedicate a month or so trying to figure it all out. Given that the last time I tried to do that, my PR was abandoned for a different reason, I was preparing to keeping my Ceph pinned until the next release and just having my Raspberry Pi setup sit there idle.
I was also being impatient and there was something new to try out.
Then I noticed a little thing on top of the NixOS wiki for Ceph:
Another distributed filesystem alternative you may evaluate is SeaweedFS.
I vaguely remember looking at it when I first set up my 22 TB Ceph cluster, but I'd been dreaming about having a Ceph cluster for so long that it was dismissed because I really wanted to do Ceph.
Now, the need was less there so I thought I would give it a try. If anything, I still had a running Ceph cluster and I could run them side-by-side.
A big difference I noticed is that SeaweedFS has a single executable that provides everything. You can run it as a all-in-one process, but the three big services can also be run independently. That includes the master (coordinates everything), the volumes (where things are stored), and filer (make it look like a filesystem).
Also, Ceph likes to work at the block level whereas SeaweedFS wants to be pointed to plain directories. So the plan was to take the 1 TB drive for my Raspberry Pi and turn it into a little cluster to try it out.
The first thing was that SeaweedFS doesn't have any NixOS options. I couldn't find any flakes for it either. My attempt to create one took me three days with little success. Instead, I ended up cheating and just grabbed the best-looking one I could find[5] and dump it directly into my flake. It isn't even an override.
5: https://hg.sr.ht/~dermetfan/seaweedfs-nixos/browse/seaweedfs.nix?rev=tip
Yeah I would love to have a flake for this but I'm not skilled enough to create it myself.
With that, a little fumbling got a master†server up and running. You only need one of these, so pick a stable server and set it up.
# src/nodes/node-0.nix inputs @ { config , pkgs , flakes , ... }: { imports = [ ../../services/seaweedfs.nix # the file from dermetfan ]; services.seaweedfs.clusters.default = { package = pkgs.seaweedfs; masters.main = { openFirewall = true; ip = "fs.home"; # This is what shows up in the links mdir = "/var/lib/seaweedfs/master/main"; volumePreallocate = true; defaultReplication = { dataCenter = 0; rack = 0; server = 0; }; }; }; }
This is a really basic setup that doesn't really do anything. The master server is pretty much a coordinator. But, what is nice is that that it shows something by starting up a web server at `fs.home:9333` that lets you see that it is up (sadly, no dark mode) and running. This site will also let you get to all the other servers through web links.
Another important part is the `defaultReplication`. I made it explicit, but when messing around, setting all three to `0` means that you don't get hung up the first time you try to write a file and it tries to replicate to a second node that isn't set up. All zeros is basically “treat the cluster as a single large disk.”
Later on, you can change that easily. I ended up setting `rack = 1;` in the above example because I treat each node as a “rack” since I don't really have a server rack.
†I don't like using “master” and prefer main, but that is the terminology that SeaweedFS uses.
Next up was configuring a volume server. I ended up doing one per server (I have four nodes in the cluster now) even though three of them had multiple partitions/directories on different physical drives. In all of these cases, I named the directory `/mnt/fs-001` and created an `ext4` partition on it. I could have used ZFS, but I know and trust `ext4` and had trouble with ZFS years ago. But it doesn't matter, just make a drive.
# src/nodes/node-0.nix inputs @ { config , pkgs , flakes , ... }: { imports = [ ../../services/seaweedfs.nix # the file from dermetfan ]; services.seaweedfs.clusters.default = { package = pkgs.seaweedfs; volumes.${config.networking.hostName} = { openFirewall = true; dataCenter = "home"; rack = config.networking.hostName; ip = "${config.networking.hostName}.home"; dir = [ "/mnt/fs-001" ]; disk = [ "hdd" ]; # Replication gets screwy if these don't match max = [ 0 ]; port = 9334; mserver = [ { ip = "fs.home"; port = 9333; } ]; }; }; }
Once started up, this starts a service on `http://node-001.home:9333`, connects to the master which will then show a link on that page, and basically say there is plenty of space.
The key parts I found are the `disk` and `max`.
Replication is based on dataCenter, rack, and server but also only if the disk types agree. So, `hdd` will only sync to other `hdd` even if half of them are `ssd` or `nvme`. Because I have a mix of NVMe and HDD, I made them all `hdd` because it works and I don't really care.
The value of `0` for `max` means use all the available space. Otherwise, it only grabs a small number of 30 GB blocks and stops. Since I was dedicating the entire drive over to the cluster, I wanted to use everything.
The final service needed is a filer. This is basically the POSIX layer that lets you mount the drive in Linux and start to do fun things with it. Like the others, it just gets put on the server. I only set up one filer and it seems to work, but others set up multiples. I just don't really understand why.
# src/nodes/node-0.nix inputs @ { config , pkgs , flakes , ... }: { imports = [ ../../services/seaweedfs.nix # the file from dermetfan ]; services.seaweedfs.clusters.default = { package = pkgs.seaweedfs; filers.main = { openFirewall = true; dataCenter = "home"; encryptVolumeData = false; ip = "fs.home"; peers = [ ]; master = [ # this is actually in cluster.masters that I import in the real file { ip = "fs.home"; port = 9333; } ]; }; }; }
Like the others, this starts up a web service at `fs.home:8888` that lets you browse the file system, upload files, and do fun things. Once this is all deployed (by your system of choice, mine is Colmena), then it should be up and running. Which means you should be able to upload a folder with the port 8888 site.
I found the error messages are a little confusing a time, but weren't too much of a trouble to find. I just had to tail `journalctl` and then try to figure it out.
journalctl -f | grep seaweed
If you have multiple servers, debugging requires doing this to all of them.
Adding more volumes is pretty easy. I just add a Nix expression to each node to include drives.
services.seaweedfs.clusters.default = { package = pkgs.seaweedfs; volumes.main = { openFirewall = true; dataCenter = "main"; rack = config.networking.hostName; mserver = cluster.masters; # I have this expanded out above ip = "${config.networking.hostName}.home"; dir = [ "/mnt/fs-002" "/mnt/fs-007" ]; # These are two 6 TB red drives disk = [ "hdd" "hdd" ]; # Replication gets screwy if these don't match max = [ 0 0 ]; port = 9334; }; };
As soon as they deploy, they hook up automatically and increase the size of the cluster.
Mounting… this gave me a lot of trouble. Nix does not play well with the auto-mount and SeaweedFS, so I had to jump through a few hoops. In the end, I created a `mount.nix` file that I include on any node that I have to mount the cluster on, which always goes into `/mnt/cluster`.
inputs @ { config , pkgs , ... }: let mountDir = "/mnt/cluster"; # A script to go directly to the shell. shellScript = (pkgs.writeShellScriptBin "weed-shell" '' weed shell -filer fs.home:8888 -master fs.home:9333 "$@" ''); # A script to list the volumes. volumeListScript = (pkgs.writeShellScriptBin "weed-volume-list" '' echo "volume.list" | weed-shell ''); # A script to allow the file system to be mounted using Nix services. mountScript = (pkgs.writeShellScriptBin "mount.seaweedfs" '' if ${pkgs.gnugrep}/bin/grep -q ${mountDir} /proc/self/mountinfo then echo "already mounted, unmounting" exit 0 fi echo "mounting weed: ${pkgs.seaweedfs}/bin/weed" "$@" ${pkgs.seaweedfs}/bin/weed "$@" status=$? for i in 1 1 2 3 4 8 16 do echo "checking if mounted yet: $i" if ${pkgs.gnugrep}/bin/grep -q ${mountDir} /proc/self/mountinfo then echo "mounted" exit 0 fi ${pkgs.coreutils-full}/bin/sleep $i done echo "gave up: status=$status" exit $status ''); in { imports = [ ../../seaweedfs.nix ]; # The `weed fuse` returns too fast and systemd doesn't think it has succeeded # so we have a little delay put in here to give the file system a chance to # finish mounting and populate /proc/self/mountinfo before returning. environment.systemPackages = [ pkgs.seaweedfs shellScript volumeListScript mountScript ]; systemd.mounts = [ { type = "seaweedfs"; what = "fuse"; where = "${mountDir}"; mountConfig = { Options = "filer=fs.home:8888"; }; } ]; }
So, let me break this into the parts. SeaweedFS has a nice little interactive shell where you can query status, change replication, and do lots of little things. However, it requires a few parameters, so the first thing I do is create a shell script called `weed-shell` that provides those parameters so I don't have to type them.
$ weed-shell
The second thing while doing this is that I wanted to see a list of all the volumes. SeaweedFS creates 30 GB blobs for storage instead of thousands of little files. This makes things more efficient in a lot of way (replication is done on volume blocks).
$ weed-volume-list | head .> Topology volumeSizeLimit:30000 MB hdd(volume:810/1046 active:808 free:236 remote:0) DataCenter main hdd(volume:810/1046 active:808 free:236 remote:0) Rack node-0 hdd(volume:276/371 active:275 free:95 remote:0) DataNode node-0.home:9334 hdd(volume:276/371 active:275 free:95 remote:0) Disk hdd(volume:276/371 active:275 free:95 remote:0) volume id:77618 size:31474091232 file_count:16345 replica_placement:10 version:3 modified_at_second:1708137673 volume id:77620 size:31501725624 file_count:16342 delete_count:4 deleted_byte_count:7990733 replica_placement:10 version:3 modified_at_second:1708268248 volume id:77591 size:31470805832 file_count:15095 replica_placement:10 version:3 modified_at_second:1708104961 volume id:77439 size:31489572176 file_count:15067 replica_placement:10 version:3 modified_at_second:1708027468 volume id:77480 size:31528095736 file_count:15118 delete_count:1 deleted_byte_count:1133 replica_placement:10 version:3 modified_at_second:1708093312
When doing things manually, that is all I needed to see things working and get the warm and fuzzy feeling that it worked.
Getting it to automatically mount (or even `systemctl start mnt-cluser.mount`) is that the command to do so is `weed fuse /mnt/cluster -o "filer=fs.home:8888"`.
NixOS doesn't like that.
So my answer was to write a shell script that fakes a `mount.seaweedfs` and calls the right thing. Unfortunately, it rarely worked and it took me a few days to figure out why. While `weed fuse` returns right away, I'm guessing network latency means that `/proc/self/mountinfo` doesn't update for a few seconds later. But `systemd` had already queried the `mountinfo` file, saw that it wasn't mounted, and then declared the mount failed.
But, by the time I (as a slow human) looked at it, the `mountinfo` showed success.
The answer was to delay returning from `mount.seaweedfs` until we give SeaweedFS a chance to finish so `systemd` could see it was mounted and didn't fail the unit. Hence the loop, grep, and sleeping inside `mount.seaweedfs`. Figuring that out required a lot of reading code and puzzling through things to figure out, so hopefully that will help someone else.
After I did, though, it was working pretty smoothly since, including recovering on reboot.
As I mentioned above, once I was able to migrate the Ceph cluster, I changed replication to `rack = 1;` to create one extra copy across all four nodes. However, SeaweedFS doesn't automatically rebalance like Ceph does. Instead, you have to go into the shell and run some commands.
$ weed-shell lock volume.deleteEmpty -quietFor=24h -force volume.balance -force volume.fix.replication unlock exit
You can also set it up to do it automatically, I'm not entirely sure I've done that so I'm not going to show my attempt.
One of the biggest thing I noticed is that Ceph does proactive maintenance on drives. It doesn't sound like much, but I feel more comfortable that Ceph would detect errors. It also means that the hard drives are always running in my basement; just the slow grind of physical hardware as Ceph scrubs and shuffles things around.
SeaweedFS is more passive in that regard. I don't trust that it won't catch a failed hard drive a fast, but it still doesn't have the failures of RAID, lets me spread out data across multiple servers and locations. There is also a feature for uploading to a S3 server if I wanted. I use a [[Restic]] service for my S3 uploads.
That passivity also means it hasn't been grinding my drives as much and I don't have to worry about the SSDs burning out too quickly.
Another minor thing is that while there are a lot less options with SeaweedFS, it took me about a third of the time to get the cluster up and running. There were a few error messages that threw me, but for the most part, I understood the errors and what SeaweedFS was looking for. Not always the case with Ceph and I had a few year-long warnings that I never figured out how to fix that I was content to leave as-is.
I do not like the lack of dark mode on SeaweedFS's websites.
I continue to like Ceph, but I also like SeaweedFS. I would use either, depending on the expected load. If I was running Docker images or doing coding on the cluster, I would use a Ceph cluster. But, in my case, I'm using it for long-term storage, video files, assets, and photo shoots. Not to mention my dad's backups. So I don't need the interactive of Ceph along its higher level of maintenance.
Also, it is a relatively simple Go project, doesn't take six hours to build, and uses more concepts that I understand (`mkfs.ext4`) that I'm more comfortable with it.
It was also available at the point I wanted to play (though Ceph is building on NixOS unstable again, so that is moot problem. I was just being impatient and wanted to learn something new.)
At the moment, SeaweedFS works out nice for my use case and decided to switch my entire Ceph cluster over. I don't feel as safe with SeaweedFS, but I feel Safe Enough™.
Categories:
Tags:
Below are various useful links within this site and to related sites (not all have been converted over to Gemini).
https://d.moonfire.us/blog/2024/03/21/switching-ceph-to-seaweedfs/