Hi. My name is Agent Conner. This is my partner, Agent Grosberg. [1] This is our story. (insert Dragnet Theme here)
At 2130 we recieved the call from John the paper millionaire of a dotcom. He has a problem. A computer problem. A major computer problem and he's calling in the experts. That's us.
It seems that there is a problem with his RAID system (Mark, I need details on the RAID system). Upon investigation it seems that the hardware is fine. It's the software that is a problem. Or rather, the operating system has a problem that leads to a corrupt file system.
Rule 1. Just because you have RAID doesn't mean your data can't get lost or corrupted.
The operating system in question is Microsoft Windows NT 4.0 Service Pack 3. There's a reason he's at Service Pack 3—it works with his RAID system, and that was hard enough to get running. His entire dotcom runs under NT. All his data, his critical data, relies upon Microsoft Windows NT to be stable.
Rule 2. No Fortune 100 Company uses Microsoft Windows NT for financial or critical applications. None.
Corollary 2: Microsoft is a Fortune 100 Company.
From our investigation we were able to asertain that Microsoft Windows NT has a problem with filesystems that contain over four million files. John the paper millionaire of a dotcom has a filesystem with over four million files. John's data is slowly being corrupted.
Rule 3. See Rule 2.
John the paper millionaire of a dotcom now knows the difficulty of using Microsoft Windows NT for a critical application. But that still doesn't help him.
Any attempt to delete, copy, move or rename the file fails with a modal dialog box popping up informing the user that the operating system cannot delete, copy, move or rename said file. You have to click “OK” to make it go away.
Rule 4. Any software that requires user intervention can't be used in a server capacity.
The backup program John uses has failed multiple times in face of said files. Therefore it is proving difficult to get a reliable backup of the four million plus files that John needs to run his business. Microsoft does have a patch available for said bug, but the time frame required to run CHKDISK is unacceptable, possibly taking up to four days to run.
Rule 5. Any backup software that cannot run in the face of errors (even if told to ignore said file and carry on) should not be used in a server capacity.
We did manage to test the GNU tar program under Microsoft Windows NT and it carried on, ignoring the corrupt files. But there doesn't seem to be a way to actually reference the tape backup unit from the command line, and there is not enough free space to backup onto disk. And the number of corrupt files seems to be relatively few, about a hundred.
But since you can't delete, move, copy or rename the files, it's hard to work around them. Another method would be to put the RAID system into read-only mode, make a backup of the RAID system (by swapping drives in and out of the hot-swappable RAID system to build a backup set of drives with the data on it, set up a separate system with said RAID backup, and go from there) but we have to see what John's bosses say to that (John became a paper millionaire of a dotcom by having his dotcom being bought out).
The case is still open …