💾 Archived View for samsai.eu › gemlog › 2023-07-29-disk-failure.gemini captured on 2023-11-04 at 11:09:14. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Disk failure

- 2023-07-29

This week I had ran into a proper disk failure, which was an interesting experience. This was actually the first time I've had to deal with a failing disk in a RAID1 array and I don't recall encountering other disk failures with BTRFS before this one either.

But yeah, I keep a RAID1 array of 2 hard drives for storing my livestream VODs. It started out as my general video production array back when I was still doing YouTube, but these days it pretty much just stores livestreams I've done over the years. The only real "video production" I do on it is re-encoding my VODs from H.264 to AV1 and then uploading them to my livestream VOD site:

My VOD site in case you are curious

This array originally was just a single 1 TB Seagate Barracuda (I've gone through many Barracudas over the years), but when I started moving my stuff into BTRFS, I figured I would throw together a RAID1 out of two HDDs (1 TB at the time), mostly as a fun exercise.

2020-11-28: BTRFS is consuming my computer

A bit of background

Anyway, I've over time grown that array and eventually when I moved basically everything to SSDs, I was left with a 2 TB Seagate FireCuda drive that I built the array around that drive and got a second 2 TB drive to act as the mirror.

Those FireCudas are kind of interesting. Seagate doesn't make them anymore because cheap SSDs ate their lunch, but they were basically hybdrid drives that were supposed to be a middle-ground between a small-but-fast SSD and a big-but-slow HDD. Basically an HDD with an SSD cache to speed things up. They are also quite unreliable. I've had two of them over the years and the first one died pretty quick. This second one survived into 3 years and 4 months of power-on time.

But yeah, eventually all drives must die and the FireCuda started acting up. I probably wouldn't have even noticed it otherwise, but I set up a monthly scrub and an alert service that would email my local email if services reported errors. And it just so happens that the scrub caught a lot of errors. Over 13k errors in fact.

The array was able to cope with it because the other drive was functional, so all of those errors were corrected. However, 13k errors happening suddenly was a big indicator that things were going wrong and the FireCuda SMART stats reported 2500 bad sectors. So, I knew the drive was going bad.

A day later a second scrub resulted in another few thousand errors and the bad sector count had increased to 3000. I had by this point already ordered two 4 TB drives as replacements. The next day I begun the replacement procedure and the bad sector count was already in the 6000s. At this point the drive was pretty much in free-fall and ready to keel over at any moment.

Replacement procedure

The replacement went mostly well, although I made a couple of mistakes.

First of all, when I begun the replacement, I figured I'd just fail the array by unplugging the bad drive and plugging a replacement back in and booting the array as degraded. Except systemd didn't like a degraded /etc/fstab mount and refused to boot. I couldn't even get myself an emergency shell. So, I popped the bad drive back in and booted up, figuring I'd just remove the drive from the array in the OS.

Now, this is where I messed up a bit and it's because I was being stupid. I tried to remove the drive via 'btrfs device remove', but since it's a RAID1 of two drives, the tool didn't let me drop down to a single drive. Which makes sense. I then figured that I could just switch the data to "single" profile and then boot the drive.

This is a _really_ bad move and I should have realized it instantly.

Switching a multi-device array in BTRFS to "single" means a single copy of all data. It's basically a RAID0 without the striping, you just put each drive after the other and BTRFS will write to the drives such that each has about the same amount of free space at a given time.

This meant that I unintentionally ordered BTRFS to start writing data that was mirrored without mirroring back onto the bad drive. The process made it about 500 MB in before I realized my error and cancelled the operation. I then attempted to restore that data back onto RAID1 but by this point it was already too late: the drive was so far gone that any data written to it was instantly corrupted.

The best course of action at this point was to keep the old drives in and then plug the replacement in as a third drive. I had to take out my games SSDs out of my /etc/fstab and rip one of them out of the system to have enough SATA slots for this, but it wasn't too bad.

I then begun the process of replacing the old device with the new one, which took a couple of hours. The sketchiest moment was when the replace hung at the end, but after checking 'journalctl' it became clear to me that one file was failing to read. This was the 500 MB that I had written back onto the bad disk. I copied the parts of it that were still readable to an external drive and deleted the original, after which the replace concluded successfully.

Conclusion

So, I ran into a bad drive failure and walked away from it with only one VOD file damaged. I would have had a perfect recovery had I not been stupid and asked the filesystem to convert a readable mirror into single data that was vulnerable to corruption. May this serve as a lesson for me in the future.

The biggest annoyance here were the fact that my system just wouldn't boot in a degraded way into an emergency shell. Had it done this, I could have saved some fiddling with cables and additional boots.

It would have also been good if 'btrfs balance' required '--force' to downgrade data profile to a less safer one. It already does that for metadata, so I think it ought to do so for the data as well. This would have probably stopped me long enough to make me realize what I was about to do was stupid, although it's of course not fully guaranteed.

But yeah, overall a solid showing I think. Now my RAID1 array has been grown to 4 TB and ought to serve me decently well for a while. And the next time drives start failing on me, hopefully I'll be even better prepared to tackle the problem.