💾 Archived View for perso.pw › blog › articles › btrfs-deduplication-with-bees.gmi captured on 2023-05-24 at 18:00:37. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-01-29)
-=-=-=-=-=-=-
BTRFS is a Linux file system that uses a Copy On Write (COW) model. It is providing many features like on the fly compression, volumes management, snapshots and clones etc...
Wikipedia page about Copy on write
However, BTRFS doesn't natively support deduplication, which is a feature that looks for chunks in files to see if another file share that block, if so, only one chunk of data can be used for both files. In some scenarios, this can drastically reduce the disk space usage.
This is where we can use "bees", a program that can do offline deduplication for BTRFS file systems. In this context, offline means it's done when you run a command, while it could be live/on the fly where deduplication is instantly applied. HAMMER file system from DragonFly BSD is doing offline deduplication, while ZFS is doing it live. There are pros and cons for both models, ZFS documentation recommends 1 GB of memory per Terabyte of disk when deduplication is enabled, because it requires to have all chunks hashes in memory.
Bees is a service you need to install and start on your system, it has some limitations and caveats documented, but it should work for most users.
You can define a BTRFS file system on which you want deduplication and a load target. Bees will work silently when your system is below the load threshold, and will stop when the load exceeds the limit, this is a simple mechanism to prevent bees to eat all your system resources after some freshly modified/created files need to be scanned.
First time you run bees on a file system that is not empty, it may take a while to scan everything, but then it's really quiet except if you do heavy I/O operation like downloading big files, but it's doing a good job at staying behind the scene.
Add this code to /etc/nixos/configuration.nix and run "nixos-rebuild switch" to apply the changes.
services.beesd.filesystems = { root = { spec = "LABEL=nixos"; hashTableSizeMB = 256; verbosity = "crit"; extraOptions = [ "--loadavg-target" "2.0" ]; }; };
The code suppose your root partition is labelled "nixos", you want a hash table of 256 MB (this will be used by bees) and you don't want bees to run when the system load is more than 2.0.
You may want to tune the values, mostly the hash size, depending on your file system size. Bees is for terabytes file systems, but this doesn't mean you can use it for the average user disks.
I tried on my workstation with a lot of build artifacts and git repositories, bees reduced the disk usage from 160 GB to 124 GB, so it's a huge win here.
Later, I tried again on some Steam games with a few proton versions, it didn't save much on the games but saved a lot on the proton installations.
On my local cache server, it did save nothing, but is to be expected.
BTRFS is a solid alternative to ZFS, it requires less memory while providing volumes, snapshots and compression. The only thing it needed for me was deduplication, and I'm glad it's offline, so it doesn't use too much memory.