💾 Archived View for thrig.me › tech › var-is-full.gmi captured on 2024-02-05 at 10:49:18. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-09-28)
-=-=-=-=-=-=-
This is a standard job interview question I've much used for prospective unix sysadmins. One advantage is that it can be stopped early should they struggle with it. The trick is that there is no file (or files) that is (or are) taking up all the space, yet the partition is totally full. "Deer in the headlights" can be common here, or some have tried rebooting the server, which does fix the problem of /var being 100% full (temporarily, maybe). If I have to nudge them towards commands like df and du the question isn't going very well.
On the other hand, this question can go pretty deep and may veer off into discussions about why a filesystem can be 105% full or how some filesystems can run out of inodes, hard links versus soft links, etc., if the candidate is conversant about unix filesystems. Or, they might ask how the issue happened and what could be done to prevent it from happening, or better monitored for in the future. Was the issue due to poor log rotation, maybe? Why didn't monitoring notice before the partition was completely full? There's monitoring of disk space usage, right?
Wait, a filesystem can be more than 100% full? Yes, on BSD. Probably not linux. Details vary. What is going on here is that there is some amount of space—five percent, say—reserved for root, and if a filesystem fills up, and then root-written files eat into the root-reserved space the output from df can show more than 100% usage. Poorly written monitoring scripts may fail at this point, if they assume the number cannot be larger than 100% and then you may not be notified of the problem because your monitoring is buggy.
Why is there space reserved for root? The assumption is, or was, that if things are screwed up, root needs to be able to write to the filesystem to fix the problem, and if there's no space for them to be able to do that, then they'd need to be able to copy something somewhere else first. This is less of a problem in these days of ample and cheap storage, but could have been a huge problem if you only had one disk and did you ever test the backup tapes? No? Well…
Funny story time! A user had run their laptop disk 100% full for months. The Windows admin (this was a Windows laptop) had warned them not to do that, but did the user listen? Nope. Anyhoo the user eventually came back because the system, somehow, had gotten pretty unusable. Because the disk was 100% full, and had been, for months. So the Windows admin tried to do a backup (the user hadn't been doing any of those) which filled up the backup drive. Not daunted by using all the space on a local USB drive (which had more space available than the laptop drive being backed up), the Windows admin tried to back the disk up to the network share (very much larger than the laptop drive). This completely filled the network share, and caused a cascade of new problems. (I may have warned the Windows admin not to try that, having heard what had happened to the USB drive.) What had probably happened is that the filesystem had gotten itself looped, so therefore robocopy saw a really, really big endless graph to transfer. Wow! More files! Wow!
Being able to tell a directed acyclic graph apart from a cyclic graph might be handy. Like, how would you detect whether you were looping forever through a graph that might be pretty huge if there's ~500G of who knows how many files?
--- ... / | +---+ +------+ | / |---| /bin |--- ... +---+ +------+ | \ +------+ ---| /etc |--- ... +------+ --- ... -------------------\ / / | | | | hard link +---+ +------+ | | / |---| /bin |--- ... | +---+ +------+ | | | \ +------+ +-----------+ | ---| /etc |---| /etc/ohno |---/ +------+ +-----------+
Make a test virtual system, or at least a very small test partition, ideally both in the event you screw up and delete everything. Then fill the partition up. Then try different tools to see what is reported. Is there a difference if root fills up the partition, as opposed to some other user? What happens if you create lots and lots of empty files? What happens if you have a process writing to a file, perhaps to a log file under /var, and then you unlink (rm) that file while the process is still running?
Practice is easier than learning how to do this in production, during an outage, while going on probably too little sleep.
http://man.openbsd.org/man1/df.1
http://man.openbsd.org/man1/du.1
http://man.openbsd.org/man1/find.1
http://man.openbsd.org/man1/fuser.1
http://man.openbsd.org/man1/yes.1
http://man.openbsd.org/man4/zero.4
On some systems fuser may need be replaced with lsof or ss. The unix rosetta stone might help, or check the fine documentation that comes with the system in question.
Don't run filesystems too full is probably a moral to be had. Hence asking questions about it to prospective sysadmins. Other questions include how to monitor systems for problems (one answer: they ran top in a terminal).
tags #unix