Subject: RISKS DIGEST 10.33 REPLY-TO: risks@csl.sri.com RISKS-LIST: RISKS-FORUM Digest Friday 7 September 1990 Volume 10 : Issue 33 FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Contents: Critical military computer systems (Clifford Johnson) Complexity, reliability, and meaningless arguments (Nancy Leveson) Re: "Wild Failure Modes" in Analog Systems (Jan Wolitzky) Analog vs Digital Controls (Martin Ewing) Chaos (Peter da Silva) Re: Software bugs "stay fixed"? (Bruce Hamilton, K. M. Sandberg, Andrew Koenig, Michael Tanner) Boot camping (Timothy VanFosson) The RISKS Forum is moderated. Contributions should be relevant, sound, in good taste, objective, coherent, concise, and nonrepetitious. Diversity is welcome. CONTRIBUTIONS to RISKS@CSL.SRI.COM, with relevant, substantive "Subject:" line (otherwise they may be ignored). REQUESTS to RISKS-Request@CSL.SRI.COM. TO FTP VOL i ISSUE j: ftp CRVAX.sri.comlogin anonymousAnyNonNullPW cd sys$user2:[risks]GET RISKS-i.j ; j is TWO digits. Vol summaries in risks-i.00 (j=0); "dir risks-*.*" gives directory listing of back issues. ALL CONTRIBUTIONS ARE CONSIDERED AS PERSONAL COMMENTS; USUAL DISCLAIMERS APPLY. The most relevant contributions may appear in the RISKS section of regular issues of ACM SIGSOFT's SOFTWARE ENGINEERING NOTES, unless you state otherwise. ---------------------------------------------------------------------- Date: Fri, 7 Sep 90 10:19:42 PDT From: "Clifford Johnson" Subject: Critical military computer systems The herein-debated list of critical computer applications, in which reliance on computers is to be avoided includes, re defense, mere early warning systems. Presumably, Space Command's rate of false alerts, and the Vincennes shootdown, contribute to this opinion. But there is an important nuance neglected in challenging the warning systems -- early warning is clearly beneficial, problems arise only when an immediate ("use-or-lose") decision to retaliate is contingent upon it. Thus, it is really the de facto computerization of decision-to-shoot procedures that is at fault, not the neutral computerization of warning information. And so I would not avoid early warning systems, which can greatly assist taking evasive or preparatory actions, but would squarely challenge the computerization of command and control systems. The leading example of such damnably dangerous computerization is the under-development, half-billion dollar Rapid Execution And Combat Targeting system, which will enable virtually instantaneous launch of U.S. ICBM's within a couple of minutes, at all times. This includes the introduction of PC's into launch silos, which will automate launch code verification, and which will provide some sort of direct electronic interface with the missiles. Besides actualizing launch on warning and sudden first strike capabilities, the implementation of REACT would seem to add to the risk of an accidental launch, even without a flimsy attack warning. (If launch codes are received at the silos, standing orders require their immediate execution...) ------------------------------ Date: Thu, 06 Sep 90 19:50:38 -0700 From: Nancy Leveson Subject: Complexity, reliability, and meaningless arguments To save my having to mail this information individually to the many people who have asked: The next meeting of SC 167 (the RTCA committee rewriting DO-178A) will be November 6-9 in Herndon VA (outside of D.C.). You can get on the mailing list for notification of meetings by calling the RTCA (Radio Technical Commission for Avionics) at (202) 682-0266. With regard to the complexity discussion, does the question of whether one generic type of system is more complex or more reliable than another even make any sense? The same function can be implemented in a simple or "complex" way using any generic type of components. Consider Rube Goldberg's design for a "simplified" pencil sharpener. It starts with a string attached to a kite flying outside a window. When the window is opened, the string lifts the door on a cage filled with moths allowing them to escape and eat a red flannel shirt hanging above the cage. As the weight of the shirt decreases, a shoe (attached to the top of the shirt via a string through a pulley) becomes heavier than the shirt and starts to move downward, flipping a power switch on. When the power goes on, an iron on top of some pants on an ironing board burns a hole in the pants, creating smoke which enters a hole in a tree trunk next to the ironing board, smoking out an opossum which jumps into a basket from a higher hole in the tree, pulling a rope that lifts a cage door allowing a woodpecker to chew the wood from the pencil exposing the lead. There is also an emergency knife which is always handy in case the opossum or the woodpecker gets sick and can't work. One could argue that Goldberg's simplified design has a larger number of failure modes with a high probability of occurring and therefore will be less reliable than more traditionally-designed pencil sharpeners. However, his design, although it may fail more often, has the backup knife which may result in a higher probability of resulting in having a way to get your pencil sharpened (even if a cat comes in through the open window and distracts the opossum and the woodpecker) than a traditional pencil sharpener without the knife. So it is not only the number and probability of the failure modes that counts, but also the ways you have provided for coping with component failure. Consider also that a knife alone would be much more reliable than even a regular pencil sharpener (especially one of the Ginzu knives that the TV spokespeople tell me never get dull). But it is definitely less safe in terms of potential for drawing blood. So if safety rather than reliability is your higher priority goal ... When comparing the reliability and safety of mechanical/analog systems and digital systems, you need to consider: 1) Confidence and the ability to measure or assess reliability and safety in our systems may be more important than other factors. I would prefer to design critical systems with components having known failure modes and failure rates than those that MIGHT have lower failure rates, but also might have higher ones and I have NO way to determine this with high confidence. 2) Analog and mechanical designs are often reused and perfected over long periods of time. Not only does this tend to eliminate design errors, but it allows for high confidence in the failure rates and the projected failure modes. Do unexpected failure modes pop up occasionally that were not expected? Sure, so what? -- the alternatives are worse. 3) Wearout failures are much easier to detect and protect yourself against (e.g., simple redundancy usually provides adequate protection) than design errors resulting in erroneous answers. 4) Tools and methods for building systems reliably and safely may be as important as other factors. For example, system safety engineers have many time-honed procedures for assessing and enhancing safety in analog/mechanical systems but few of these have been extended to digital systems. Same applies to mechanical engineers. And they tend to be trained in using these procedures. 5) Because it is (seemingly) easy to provide a great deal of functionality with little increased cost or trouble, digital components tend to have greater functionality demanded of them (it is the usual argument for replacing mechanical/analog devices). This increases the probability of design errors. 6) ... [lots of other complicating factors] ------------------------------ Date: Fri, 7 Sep 90 10:02 EDT From: wolit@mhuxd.att.com Subject: Re: "Wild Failure Modes" in Analog Systems (Hoover, RISKS-10.31) Might as well carry this nit-picking one level further. As long as your computer's transistors, capacitors, or whatever rely on electrons, photons, or other quantum-mechanical wave/particles with discrete states, you are justified in considering them to be digital. But this is all silly -- the implementation is irrelevant. If you can treat the computer as a black box that behaves digitally, why not label it as such? Jan Wolitzky, AT&T Bell Labs, Murray Hill, NJ; 201 582-2998 att!mhuxd!wolit (Affiliation given for identification purposes only) ------------------------------ Date: Thu, 6 Sep 90 22:27 EDT From: Martin Ewing Subject: Analog vs Digital Controls Analog controls are not really the opposite of digital. The main difference is that digital logic often uses saturated transistors and obscure data coding as a representation, or analog, of a physical parameter. Digital systems do tend to use an enormous number of transistors for even the simplest operations, but they are integrated into a manageable number of chips. Analog systems are plagued by poor gain calibrations, temperature drifts, nonlinearities, and noise. Nonlinearities can result in saturation and "latch-up" behavior. AC systems suffer from crosstalk, parasitic oscillations, and lots of other ills. A component failure can easily produce as drastic a change in output as a digital failure might. The "advantage" of analog systems is that they don't have software. However, they do have all the troubles listed above, which tend to limit functionality. They also have circuit designers instead of programmers. The safest control systems are passive ones, which use no analogs: reactors that get less reactive at high temperatures and aircraft that fly themselves with no control forces. Martin Ewing, 203-432-4243, Ewing@Yale.Edu Yale University Science & Engineering Computing Facility ------------------------------ Date: Thu Sep 6 22:33:54 1990 From: peter@ficc.ferranti.com (Peter da Silva) Subject: Chaos > Thus the possible range of catastrophic effects are inherently greater in > digital as opposed to analog systems. Like the Tacoma Narrows bridge? Peter da Silva. +1 713 274 5180. peter@ferranti.com ------------------------------ Date: Thu, 6 Sep 1990 19:09:43 PDT From: Bruce_Hamilton.OSBU_South@Xerox.com Subject: Re: Software bugs "stay fixed"? (RISKS-10.31, Parnas RISKS-10.32) Re: "My perception is that they stay fixed, if they were actually fixed." A nontrivial portion of the bugs we encounter in building and testing our large systems are INTEGRATION (system-building) errors, where the wrong version of some software was included. Coding errors are only HALF the reason for regression testing. Bruce BHamilton.osbuSouth@Xerox.COM 213/333-8075 ------------------------------ Date: 7 Sep 90 11:37:16 GMT From: sandberg@ipla01.hac.com (K. M. Sandberg) Subject: Re: Software bugs "stay fixed"? (Parnas, RISKS-10.31) One problem is that sometimes the source code is not managed properly and code that has the bug is reintroduced when fixing another bug. Also it is possible that the code was "shared" and used in other programs/subroutines or the logic that caused the bug is still in the programmer's head. Major updates to the code could also lead to the reintroduction of the bug for several reasons including some one removing the fix as it seems not to be needed (lack of comments?) In other words there are many things that could cause the bug to reappear when it was really fixed. This is the real world where anything is possible (Remember Murphy's Law). Kemasa. ------------------------------ Date: Fri, 7 Sep 90 09:29:12 EDT From: ark@research.att.com Subject: Re: Software bugs "stay fixed"? (RISKS-10.31) I have had more experiences than I care to think about in which bugs have been fixed, and fixed correctly, but then somehow the wrong version of the program was sent to the user. My `debugging rule number 0' is: before you go looking for a bug, make sure the program you're looking at is the one you're running. You'd be amazed how many bugs have disappeared that way. --Andrew Koenig ------------------------------ Date: Fri, 7 Sep 90 09:35:41 -0400 From: mtanner@gmuvax2.gmu.edu (Michael tanner) Subject: Re: Software bugs "stay fixed"? (Parnas, RISKS-10.32) In practice the following occurs: 1. Programmer A fixes a bug. Some time later programmer B is given the same software to fix a different bug, or otherwise make changes. He sees some extraneous code he doesn't understand, doesn't see how it could work, or whatever and in an attempt to clean up the program, deletes or changes it. This turns out to be programmer A's bug fix, and the old bug is reintroduced. Or, 2. Large systems get re-built occasionally, and sometimes with old versions of some routines, thus introducing old "fixed" bugs. Users are accustomed to seeing old bugs resurface, and programmers often find the above scenarios to be the cause. Maybe good software practice would prevent it, but it does happen. -- mike Michael C. Tanner, Dept. of Computer Science, George Mason University ------------------------------ Date: Fri, 7 Sep 90 14:49:11 GMT From: Timothy VanFosson Subject: Boot camping (Ultrix, Wortman, RISKS-10.32) I too had a similar problem because of *my* fit of tidiness. Although my machine (a VS3100) would boot, certain login ids would would be required to go through the login process (Xprompter) two or three times before they would actually work. I know this is true because my id was one of them. I guess an added risk to the situation is that you may go crazy trying to remember your last three months' worth of passwords before you figure out that it is an OS problem :-). Timothy VanFosson, Senior Systems Analyst, University of Iowa CAD-Research, 228 ERF, Iowa City, Iowa 52242 Phone : (319) 335-5728 ------------------------------ End of RISKS-FORUM Digest 10.33 ************************