💾 Archived View for d.moonfire.us › blog › 2022 › 12 › 10 › ceph-and-nixos captured on 2024-09-29 at 00:57:54. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-04-26)
-=-=-=-=-=-=-
Recently along the continuing chain of my entanglement[1], the media server died. We've been ignoring the symptoms for a while such as randomly crashing with hard drive failures, refusing to start up, and occasionally making a high-pitched noise. All signs that something terrible was going to happy but we were in the middle of other problems and just pushed it off.
Then it died properly and wouldn't get back up.
Over the years, I've been looking at Ceph[2] ever since DreamHost[3] did a blog post about it. It seemed perfect for my previous experiences in crashing RAIDs and trying to find enough disk space to ram “just one more database” needed to get an analysis done. These same problems leaked into my media collection and keeping track of Partner's photo shoots.
When the drives crashed, I figured I'd sit down and do a Ceph cluster instead and see if it would be easier to bring more drives online as I run out without having to tear down servers and replace drives or find another spot to fit the server. Plus the whole idea of being able to have duplicates appealed to me.
After a bit of hemming and hawing, I picked up five 6 TB drives and waited half a month for them to show up (because I'm avoiding Amazon as much as possible and NewEgg[4] is great (except that I cannot run their application on my phone).
After putting in the first drive, I promptly set up the other four for Ceph. Because of various bugs, frustrations, and learning curve difficulties, it took me almost a week. I was expecting NixOS to be able to set up drives, but it couldn't. It has options for things, but there are a lot of little fiddly bits and dials to do manually and then turn them into the Nixian way after the fact.
Setting up an OSD on Nix seems pretty simple:
services.ceph.osd = { enable = true; daemons = ["0"]; };
But that was not meant to be. NixOS doesn't really set things up for you, so you have to do it manually. Also, Ceph plays well with `systemd` but it does not play with Nix's version of systemd.
ceph-volume lvm create --data /dev/sdb --no-systemd
The above command will create the proper entries in `/var/lib/ceph/osd/osd-0` and things will appear to be working fine. But I found this is a lie. When the system restarts, it will helpfully wipe out the entire contents of `/var/lib/ceph/osd/osd-0`.
To fix that, after I call `ceph-volume` above, I had to do this to get the state to recover restarting.
cd /var/lib/ceph/osd tar -cjf ~/osd-0.tar.bz2 osd-0 systemctl restart ceph-osd-0.service tar -xjf ~/osd-0.tar.bz2 chown -R ceph:disk osd-0 systemctl restart ceph-osd-0.service
After that, `systemctl restart ceph-osd.target` didn't blow away all my files. This is because there are some tmpfile rules in the systemd configuration for OSD that don't seem to be in the mon, mds, or mgr entries.
It took another week of copying as much as I could off the existing corrupted drives. Thankful, I could mount them with `ntfs-3g` which means `rsync` could grind through them, spending about an hour for every file that was unrecoverable before moving on.
Over the years, I've had to recover bad drives thrice now. One time it took me almost two weeks to recover what I could off of Partner's laptop drives. This was one of the big reasons why I decided to go with Ceph, to handle the cases when we have photoshoots and large files and then lose the hard drive they are stored on.
I managed to get it up to find out my first mistake: Ceph needs 1 GB of RAM for every 1 TB of disk and I had 24 TB on a 16 GB machine. I also only had a 8 GB swap partition for the machine, which means everything was just grinding away.
To compensate, I could have removed the drives but I had just spent a week copying files over and basically the old drives were toast. So I spent a few hundred dollars and picked up two cheap Dell business towers from NewEgg instead. When they showed up, I started the process of moving one of the drive which involves “out” the OSD (the disk) and then Ceph gracefully moved the files off that drive so it can be safely moved.
The second mistake was a minor one, the new computers didn't have drive rails so I needed to order a few more ports while I waited for the first disk to be cleared off so I could move it to a second drive I decided to put the 3.5" drives into the 5.25" bay because I didn't need the DVD and it was easier to run the wires.
The third mistake was probably a big one. I picked the wrong drive to “out” and moved a live drive into the second machine. I also learned that Ceph is very tolerant of moving said drives but it gets very cranky. Since the old PC (my first Ceph server) was struggling, the rubber for the mounting screws cracked so I decided to go the slow approach and just “in” the drive I thought I was going to remove, “out” the one I'm actually moving, and used a Sharpie to identify said drives so I don't make that mistake again.
When I had three systems, I decided I needed to bring up a monitor on all three to get some balancing. Not to mention having a monitor go down means the entire system crashed. However, this took me a few tries because of how NixOS handles systems.
export MID=$(hostname) export MIP=$(host $MID.local | cut -f 4 -d ' ') cd /var/lib/ceph/mon mkdir ceph-$MID cd ceph-$MID mkdir /tmp/add-ceph-mon ceph auth get mon. -o /tmp/add-ceph-mon/keyring ceph mon getmap -o /tmp/add-ceph-mon/map ceph-mon -i $MID --mkfs --monmap /tmp/add-ceph-mon/map --keyring /tmp/add-ceph-mon/keyring ceph-mon -i $MID --public-addr $MIP rm -rf /tmp/add-ceph-mon
Starting up the new monitor is fine as long as you do not touch the NixOS configuration files. This first part uses `killall` and `host` to make the monitor ID equal to the server. When it comes up, it will be running in daemon mode which means NixOS will forget it on restart:
root 1792 1.0 0.3 575324 59284 ? Ssl 17:00 0:00 ceph-mon -i notil --public-addr 192.168.2.36
In this case, I kill that process, either with `kill 1792` or `killall ceph-mon`.
Then, I configure the server to start up the server and use `colmena` to push out the changes.
services.ceph = { mon.enable = true; mon.daemons = ["paruk"]; };
Once I make sure everything is finally up, then I add them to the initial hosts.
services.ceph = { global = { monInitialMembers = "muliq,notil,paruk"; monHost = "muliq.local,notil.local,paruk.local"; }; };
Overall, I'm still happy with this. There is a huge learning curve when it comes to Ceph and NixOS, mainly in how they interact with each other. I assume there will be more difficulties but I seem to be heading in the right direction.
Categories:
Tags:
Below are various useful links within this site and to related sites (not all have been converted over to Gemini).