Subject: RISKS DIGEST 10.33
REPLY-TO: risks@csl.sri.com

RISKS-LIST: RISKS-FORUM Digest  Friday 7 September 1990   Volume 10 : Issue 33

        FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS 
   ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Contents:
  Critical military computer systems (Clifford Johnson)
  Complexity, reliability, and meaningless arguments (Nancy Leveson)
  Re: "Wild Failure Modes" in Analog Systems (Jan Wolitzky)
  Analog vs Digital Controls (Martin Ewing)
  Chaos (Peter da Silva)
  Re: Software bugs "stay fixed"? 
    (Bruce Hamilton, K. M. Sandberg, Andrew Koenig, Michael Tanner)
  Boot camping (Timothy VanFosson)

The RISKS Forum is moderated.  Contributions should be relevant, sound, in good
taste, objective, coherent, concise, and nonrepetitious.  Diversity is welcome.
CONTRIBUTIONS to RISKS@CSL.SRI.COM, with relevant, substantive "Subject:" line
(otherwise they may be ignored).  REQUESTS to RISKS-Request@CSL.SRI.COM.
TO FTP VOL i ISSUE j:  ftp CRVAX.sri.com<CR>login anonymous<CR>AnyNonNullPW<CR>
cd sys$user2:[risks]<CR>GET RISKS-i.j <CR>; j is TWO digits.  Vol summaries in 
risks-i.00 (j=0); "dir risks-*.*<CR>" gives directory listing of back issues.
ALL CONTRIBUTIONS ARE CONSIDERED AS PERSONAL COMMENTS; USUAL DISCLAIMERS APPLY.
The most relevant contributions may appear in the RISKS section of regular
issues of ACM SIGSOFT's SOFTWARE ENGINEERING NOTES, unless you state otherwise.

----------------------------------------------------------------------

Date:      Fri,  7 Sep 90 10:19:42 PDT
From: "Clifford Johnson" <GA.CJJ@Forsythe.Stanford.EDU>
Subject: Critical military computer systems

The herein-debated list of critical computer applications, in which reliance on
computers is to be avoided includes, re defense, mere early warning systems.
Presumably, Space Command's rate of false alerts, and the Vincennes shootdown,
contribute to this opinion.  But there is an important nuance neglected in
challenging the warning systems -- early warning is clearly beneficial,
problems arise only when an immediate ("use-or-lose") decision to retaliate is
contingent upon it.  Thus, it is really the de facto computerization of
decision-to-shoot procedures that is at fault, not the neutral computerization
of warning information.

And so I would not avoid early warning systems, which can greatly assist taking
evasive or preparatory actions, but would squarely challenge the
computerization of command and control systems.  The leading example of such
damnably dangerous computerization is the under-development, half-billion
dollar Rapid Execution And Combat Targeting system, which will enable virtually
instantaneous launch of U.S. ICBM's within a couple of minutes, at all times.
This includes the introduction of PC's into launch silos, which will automate
launch code verification, and which will provide some sort of direct electronic
interface with the missiles.

Besides actualizing launch on warning and sudden first strike capabilities, the
implementation of REACT would seem to add to the risk of an accidental launch,
even without a flimsy attack warning.  (If launch codes are received at the
silos, standing orders require their immediate execution...)

------------------------------

Date: Thu, 06 Sep 90 19:50:38 -0700
From: Nancy Leveson <nancy@murphy.ICS.UCI.EDU>
Subject: Complexity, reliability, and meaningless arguments

To save my having to mail this information individually to the many people
who have asked:

   The next meeting of SC 167 (the RTCA committee rewriting DO-178A) will be
   November 6-9 in Herndon VA (outside of D.C.).  You can get on the mailing
   list for notification of meetings by calling the RTCA (Radio Technical
   Commission for Avionics) at (202) 682-0266.

With regard to the complexity discussion, does the question of whether one
generic type of system is more complex or more reliable than another even 
make any sense? The same function can be implemented in a simple or "complex" 
way using any generic type of components.

Consider Rube Goldberg's design for a "simplified" pencil sharpener.  It starts
with a string attached to a kite flying outside a window.  When the window is
opened, the string lifts the door on a cage filled with moths allowing them to
escape and eat a red flannel shirt hanging above the cage.  As the weight of
the shirt decreases, a shoe (attached to the top of the shirt via a string
through a pulley) becomes heavier than the shirt and starts to move downward,
flipping a power switch on.  When the power goes on, an iron on top of some
pants on an ironing board burns a hole in the pants, creating smoke which
enters a hole in a tree trunk next to the ironing board, smoking out an opossum
which jumps into a basket from a higher hole in the tree, pulling a rope that
lifts a cage door allowing a woodpecker to chew the wood from the pencil
exposing the lead.  There is also an emergency knife which is always handy in
case the opossum or the woodpecker gets sick and can't work.

One could argue that Goldberg's simplified design has a larger number of
failure modes with a high probability of occurring and therefore will be less
reliable than more traditionally-designed pencil sharpeners.  However, his
design, although it may fail more often, has the backup knife which may result
in a higher probability of resulting in having a way to get your pencil
sharpened (even if a cat comes in through the open window and distracts the
opossum and the woodpecker) than a traditional pencil sharpener without the
knife.  So it is not only the number and probability of the failure modes that
counts, but also the ways you have provided for coping with component failure.

Consider also that a knife alone would be much more reliable than even a
regular pencil sharpener (especially one of the Ginzu knives that the TV
spokespeople tell me never get dull).  But it is definitely less safe in terms
of potential for drawing blood.  So if safety rather than reliability is your
higher priority goal ...

When comparing the reliability and safety of mechanical/analog systems and
digital systems, you need to consider:

  1)  Confidence and the ability to measure or assess reliability and
      safety in our systems may be more important than other factors. 
      I would prefer to design critical systems with components having known 
      failure modes and failure rates than those that MIGHT have lower failure
      rates, but also might have higher ones and I have NO way to determine 
      this with high confidence.
      
  2)  Analog and mechanical designs are often reused and perfected over long
      periods of time.  Not only does this tend to eliminate design errors,
      but it allows for high confidence in the failure rates and the projected
      failure modes.  Do unexpected failure modes pop up occasionally that were
      not expected?  Sure, so what? -- the alternatives are worse.

  3)  Wearout failures are much easier to detect and protect yourself 
      against (e.g., simple redundancy usually provides adequate protection)
      than design errors resulting in erroneous answers.

  4)  Tools and methods for building systems reliably and safely may be as
      important as other factors.  For example, system safety engineers have 
      many time-honed procedures for assessing and enhancing safety in 
      analog/mechanical systems but few of these have been extended to digital 
      systems.  Same applies to mechanical engineers.  And they tend to be
      trained in using these procedures.

  5)  Because it is (seemingly) easy to provide a great deal of functionality
      with little increased cost or trouble, digital components tend to have 
      greater functionality demanded of them (it is the usual argument for 
      replacing mechanical/analog devices).  This increases the probability 
      of design errors.

  6)  ... [lots of other complicating factors]

------------------------------

Date: Fri, 7 Sep 90 10:02 EDT
From: wolit@mhuxd.att.com
Subject: Re: "Wild Failure Modes" in Analog Systems (Hoover, RISKS-10.31)

Might as well carry this nit-picking one level further.  As long as
your computer's transistors, capacitors, or whatever rely on electrons,
photons, or other quantum-mechanical wave/particles with discrete
states, you are justified in considering them to be digital.  But this
is all silly -- the implementation is irrelevant.  If you can treat
the computer as a black box that behaves digitally, why not label it
as such?

Jan Wolitzky, AT&T Bell Labs, Murray Hill, NJ; 201 582-2998 att!mhuxd!wolit 
(Affiliation given for identification purposes only)

------------------------------

Date: Thu, 6 Sep 90 22:27 EDT
From: Martin Ewing <EWING@Venus.YCC.Yale.Edu>
Subject: Analog vs Digital Controls

Analog controls are not really the opposite of digital.  The main difference is
that digital logic often uses saturated transistors and obscure data coding as
a representation, or analog, of a physical parameter.  Digital systems do tend
to use an enormous number of transistors for even the simplest operations, but
they are integrated into a manageable number of chips.

Analog systems are plagued by poor gain calibrations, temperature drifts,
nonlinearities, and noise.  Nonlinearities can result in saturation and
"latch-up" behavior.  AC systems suffer from crosstalk, parasitic oscillations,
and lots of other ills.  A component failure can easily produce as drastic a
change in output as a digital failure might.

The "advantage" of analog systems is that they don't have software.  However,
they do have all the troubles listed above, which tend to limit functionality.
They also have circuit designers instead of programmers.

The safest control systems are passive ones, which use no analogs: reactors
that get less reactive at high temperatures and aircraft that fly themselves
with no control forces.

Martin Ewing, 203-432-4243, Ewing@Yale.Edu
Yale University Science & Engineering Computing Facility

------------------------------

Date: Thu Sep  6 22:33:54 1990
From: peter@ficc.ferranti.com (Peter da Silva)
Subject: Chaos

> Thus the possible range of catastrophic effects are inherently greater in
> digital as opposed to analog systems.

Like the Tacoma Narrows bridge?

Peter da Silva.     +1 713 274 5180.     peter@ferranti.com

------------------------------

Date: 	Thu, 6 Sep 1990 19:09:43 PDT
From: Bruce_Hamilton.OSBU_South@Xerox.com
Subject: Re: Software bugs "stay fixed"? (RISKS-10.31, Parnas RISKS-10.32)

Re: "My perception is that they stay fixed, if they were actually fixed."

A nontrivial portion of the bugs we encounter in building and testing our large
systems are INTEGRATION (system-building) errors, where the wrong version of
some software was included.  Coding errors are only HALF the reason for
regression testing.

Bruce         BHamilton.osbuSouth@Xerox.COM        213/333-8075

------------------------------

Date: 7 Sep 90 11:37:16 GMT
From: sandberg@ipla01.hac.com (K. M. Sandberg)
Subject: Re: Software bugs "stay fixed"? (Parnas, RISKS-10.31)

One problem is that sometimes the source code is not managed properly and code
that has the bug is reintroduced when fixing another bug. Also it is possible
that the code was "shared" and used in other programs/subroutines or the logic
that caused the bug is still in the programmer's head.  Major updates to the
code could also lead to the reintroduction of the bug for several reasons
including some one removing the fix as it seems not to be needed (lack of
comments?)

In other words there are many things that could cause the bug to reappear when
it was really fixed. This is the real world where anything is possible
(Remember Murphy's Law).
						Kemasa.

------------------------------

Date: Fri, 7 Sep 90 09:29:12 EDT
From: ark@research.att.com
Subject: Re: Software bugs "stay fixed"? (RISKS-10.31)

I have had more experiences than I care to think about in which bugs have been
fixed, and fixed correctly, but then somehow the wrong version of the program
was sent to the user.

My `debugging rule number 0' is: before you go looking for a bug, make sure the
program you're looking at is the one you're running.  You'd be amazed how many
bugs have disappeared that way.
             				    --Andrew Koenig

------------------------------

Date: Fri, 7 Sep 90 09:35:41 -0400
From: mtanner@gmuvax2.gmu.edu (Michael tanner)
Subject: Re: Software bugs "stay fixed"? (Parnas, RISKS-10.32)

In practice the following occurs:

  1. Programmer A fixes a bug.  Some time later programmer B is given the same
     software to fix a different bug, or otherwise make changes.  He sees some
     extraneous code he doesn't understand, doesn't see how it could work, or
     whatever and in an attempt to clean up the program, deletes or changes
     it.  This turns out to be programmer A's bug fix, and the old bug is
     reintroduced.

  Or,

  2. Large systems get re-built occasionally, and sometimes with old versions
     of some routines, thus introducing old "fixed" bugs.

Users are accustomed to seeing old bugs resurface, and programmers often find
the above scenarios to be the cause.  Maybe good software practice would
prevent it, but it does happen.
                                                  -- mike

Michael C. Tanner, Dept. of Computer Science, George Mason University

------------------------------

Date: Fri, 7 Sep 90 14:49:11 GMT
From: Timothy VanFosson <timv@cadfx.ccad.uiowa.edu>
Subject: Boot camping (Ultrix, Wortman, RISKS-10.32)

I too had a similar problem because of *my* fit of tidiness.  Although my
machine (a VS3100) would boot, certain login ids would would be required to go
through the login process (Xprompter) two or three times before they would
actually work.  I know this is true because my id was one of them.  I guess an
added risk to the situation is that you may go crazy trying to remember your
last three months' worth of passwords before you figure out that it is an OS
problem :-).

Timothy VanFosson, Senior Systems Analyst, University of Iowa
CAD-Research, 228 ERF, Iowa City, Iowa 52242     Phone : (319) 335-5728 

------------------------------

End of RISKS-FORUM Digest 10.33
************************