Boundaries

OpenSMTPD has a feature whereby if the partition has less than 5% space available, mail is refused,

    #define MINSPACE                5
    ...
        if (100 - used < MINSPACE) {
                log_warnx("warn: not enough disk space: %llu%% left",
                    (unsigned long long) 100 - used);
                log_warnx("warn: temporarily rejecting messages");
                return 0;
        }

by way of /usr/src/usr.sbin/smtpd/queue_fs.c on OpenBSD 7.5. This may be surprising if one is prone to run the disk more than 95% full and wants messages to still be delivered, presumably as if the disk is too full then the SMTP server might be in a awkward position of having accepted a message but not having the disk space to process it, which SMTP frowns on muchly. Hence probably the reason for this check; refuse the mail unless there is "adequate" space available.

Various percentages may be high these days (the root reserved space also comes to mind) given how much disk space has grown, though some have shrunk if you're running a small SSD with a full install. On the other hand a disk might fill up rather quickly when something has gone awry.

A major question here is who should be doing what checks; some may feel that the site administrator should have monitoring, warnings and then alerts as disk space dwindles, with the alarm levels set low enough to provide ample time for notification, investigation, talking with stakeholders, and various rectifications. Not all sites may be that organized, so some could argue that OpenSMTPD should ensure the system is sane enough for its needs. Hence the boundaries title: where do you put the boundaries on who is responsible for what?

There is precedent for programs checking on resources, notably for CPU usage e.g. in rogue(6) of some ancient version where that process took up too many resources that might be better used for more productive things:

    void
    checkout(int p)
    {
        static char *msgs[] = {
            "The load is too high to be playing.  Please leave in %d minutes",
            "Please save your game.  You have %d minutes",
            "Last warning.  You have %d minutes to leave",
        };
        int checktime;
    #ifdef SIGALRM
        signal(SIGALRM, checkout);
    #endif
        if (too_much())
        {
            if (num_checks == 3)
                fatal("Sorry.  You took to long.  You are dead\n");

Sendmail also had some knobs to refuse delivery if the CPU load was "too high" though this really depends on the system and how the load is calculated, and this risks refusing mail delivery should CPU time be accumulating for a wedged NFS mount that may not involve email delivery in any way. On the other hand, I've seen a linux system do terrible things (corrupt payment transactions) at a load of 5000 where there were no checks for "is the system too busy?" nor really much if any monitoring given the nature of the system. The Sendmail (or rogue) checks could also be done by, say, a cron job that would kill rogue or touch a file to make Sendmail start refusing mail, though such external checks while allowing a site more control may not work if the system is too busy to run a new cron job or the "refuse mail delivery" flag file cannot be written because the filesystem is full—and hopefully you don't need a Kerberos ticket written to that full filesystem to be able to login! The process itself is running so may be able to check for sanity without relying on an external process that may not be able to run, or cannot run quickly enough because that fast SSD burned through the last 105% disk space that was available.

Also you probably should not run a filesystem "too full" as the filesystem or OS may not have enough free space to... manage the filesystem, or a similar problem can exist for a database that does not have enough free disk space to perform various database operations, whoops! Fancy things like logical volume managers can help here, assuming there's more disk space that can be thrown at the problem and you're not trying to backup a horribly looped NTFS filesystem that the user ran at 100% full for months and then the disk could not be backed up as there were horrible loops in it that robocopy tried to copy over and over and over. There probably has not been enough testing at various boundaries and edge conditions where the disk is too or all full, the CPU too high, the memory use too intensive, etc. Maybe there should be, to see what breaks? Or maybe things should be better monitored and administrated so the boundaries are never reached? Oh, well, we'll probably continue to muddle along as ever.

Threshold, Velocity

The usual means of checking disk space is to draw a line somewhere and notify when it is crossed. This is simple, and does not require any state. The line might need to be drawn pretty low (at an increased risk of false positives) if the storage used can grow quickly, as one may want sufficient time to notice and respond to an issue. A more complicated approach is to record a few data points (RRDtool might be good here) and try to fit a line to the disk usage, and warn when the usage will hit 100% or some other threshold in under some amount of time. This may allow an earlier warning of growth over time, and may be especially handy if there are users who need to cleanup temporary files or whatnot, as coordinating all those users when disk space is critically short can be challenging. With a trend one might notify them three weeks in advance and recommend action be taken (and you can reference the original email when they ignore your advice and do run the disk critically short).