Subject: RISKS DIGEST 10.31 REPLY-TO: risks@csl.sri.com RISKS-LIST: RISKS-FORUM Digest Wednesday 5 September 1990 Volume 10 : Issue 31 FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Contents: March 1989 British Rail Train Crash (Brian Randell) Complexity, safety and computers (Martyn Thomas) Software bugs "stay fixed"? (Martyn Thomas) Re: Stonefish mine (Mark Lomas, Bill Davidsen, Bill Ricker) Reply to "Computer Unreliability" Stars vs Selves (Dave Davis) "Wild Failure Modes" in Analog Systems (Jim Hoover, Richard D. Dean, Will Martin, Pete Mellor) The RISKS Forum is moderated. Contributions should be relevant, sound, in good taste, objective, coherent, concise, and nonrepetitious. Diversity is welcome. CONTRIBUTIONS to RISKS@CSL.SRI.COM, with relevant, substantive "Subject:" line (otherwise they may be ignored). REQUESTS to RISKS-Request@CSL.SRI.COM. TO FTP VOL i ISSUE j: ftp CRVAX.sri.comlogin anonymousAnyNonNullPW cd sys$user2:[risks]GET RISKS-i.j ; j is TWO digits. Vol summaries in risks-i.00 (j=0); "dir risks-*.*" gives directory listing of back issues. ALL CONTRIBUTIONS ARE CONSIDERED AS PERSONAL COMMENTS; USUAL DISCLAIMERS APPLY. The most relevant contributions may appear in the RISKS section of regular issues of ACM SIGSOFT's SOFTWARE ENGINEERING NOTES, unless you state otherwise. ---------------------------------------------------------------------- Date: Wed, 5 Sep 90 13:50:34 BST From: Brian Randell Subject: March 1989 British Rail Train Crash [Yesterday's Independent carried a number of articles related to the ending of the trial of a British Rail train train driver in connection with a major train accident that occurred on 4 March 1989 in south London. There was a fairly lengthy front page article, and two further articles taking up a complete half-page. I have selected from these some paragraphs which should be on interest to RISKs readers. Brian Randell, Computing Laboratory, University of Newcastle upon Tyne, UK PHONE = +44 91 222 7923 [Further excerpted by PGN] A train driver who admitted passing through a red light and causing the Purley rail crash, in which five people died and 87 were injured, was jailed yesterday. Robert Morgan 47, was sentenced to 18 months' imprisonment, 12 months of which was suspended, after pleading guilty to two charges of manslaughter. The sentence drew strong criticisms from the rail unions. The Old Bailey was told Morgan received two commendations in a previously exemplary 23 years as a train driver. [...] Julian Bevan, for the prosecution, told the Old Bailey that Morgan was in hospital with face and neck injuries a few hours after the crash when he said he had jumped the red light at about 70mph. The track limit was 90mph. Describing the safety system thrown into question by the crash, Mr Bevan said drivers were given two amber warning signals before coming to a red light. Each is accompanied by a klaxon sounding in the cab. If the driver fails to switch it off, the brakes are applied automatically within three seconds, he told the court. Morgan, a single man, of Ferring, West Sussex, admitted that he must have switched off the klaxon each time, but the memory loss he suffered prevented him from being more precise. [...] Robert Morgan, a driver since 1966, had been well warned by the signalling system that there was going to be a train ahead of him. What went wrong? For every signal a driver passes, the system provides him with an instantly recognisable set of acknowledgements. Every time a signal is passed at green, a bell rings in the cab indicating the line ahead is clear. If the signal is at double amber, or amber, or red, a klaxon sounds and has to be acknowledged by the driver. He has three seconds to do this and if he does not press the button, the brakes start to apply and are fully operational in five seconds. However, he can override this system. It is a weakness in the system that BR has now recognised. The Purley crash came just three months after the disaster at Clapham where 35 people were killed when a signal failure resulted in a commuter express ramming another stationary rush hour train. The real cause, the public inquiry said, was too much repetitive, painstaking work, not enough time off - lack of supervision and improper testing procedures among technicians completing resignalling work in the area. Since Purley, and acting on the recommendations of the Hidden Report into the Clapham disaster, BR has been overhauling its approach to safety. BR admits that prior to the Clapham crash its approach to safety was equipment-based. It reasoned that if the equipment and the rules designed to protect it worked, then the safety of staff and passengers was assured. What happened at Clapham and to Robert Morgan at Purley showed that approach to be inadequate. In coming to terms with human error, BR has introduced its new Safety Management Programme. Potential train drivers already undergo extensive psychological as well as practical testing, to ensure they are suited to working in a highly disciplined atmosphere. However, after work with Professor James Reason of the University of Manchester, a specialist in risk analysis, BR recognises that regardless of personality, all human behaviour is inherently quirky in increasingly repetitive circumstances. It understands that drivers can get into a "mind set" where they believe they have completed a task, or recognised a signal, when they have not. In that mental state, a driver could cancel a warning horn, not realise he had done so and plough on to disaster. BR has admitted that the chances of an equipment failure being the sole cause of an accident have been all but engineered out of the safety equation, and that one of the biggest risks to passengers is drivers passing signals at danger. It happens on average between 20 and 30 times a year, and each incident is investigated. [...] ------------------------------ Date: Wed, 5 Sep 90 10:57:39 BST From: Martyn Thomas Subject: Complexity, safety and computers In RISKS (10.29), David Gillespie writes: : I think one point that a lot of people have been glossing over is that in a : very real sense, computers themselves are *not* the danger in large, : safety-critical systems. The danger is in the complexity of the system itself I agree. Often, people talk about "the software reliability problem" when actually the problem is the difficulty of getting complex designs right, and the impossibility of guaranteeing that any residual errors will cause the design to fail less frequently than (some very low probability of failure). There is, of course, the related problem of what we mean by "getting the design right" and "failure". In general, these can only be defined with hindsight - we recognise that the system has entered a state which we wish it hadn't, and we define that as failure. We cannot (usually) guarantee that we have defined all safe states, or all hazardous states, in advance. This is seen as a "software problem" because we *choose* to put most of the system complexity into the software, as a sensible design decision. Recently, I have started to wonder if some of our difficulties are exacerbated by this decision. Software is digital (at the moment, at least). Yet many safety-critical systems involve monitoring analog signals and driving actuators which cause analog activity in the controlled system (for example, monitoring airspeed and driving the elevators of an aircraft). At some point in the system, the analog signal is digitised - generally before any computation is performed on it. Then the digital outputs are reconverted to analog. The question I would ask is: are we making our systems significantly more complex by converting to digital too soon (or at all)? Would the system complexity be reduced if, instead of converting to digital so that we can use a commercial microprocessor, we processed the signals as analog signals, using an application-specific integrated circuit (ASIC) and only converted to digital where there is a clear reduction in complexity from doing so? This is a serious question: latest technology allows mixed analog-digital ASICs, and the cost and time to produce an ASIC is competitive with the cost and time to produce the software and circuit board for a microprocessor system - and the technology is moving so that economics increasingly favour the use of ASICs. You can have (some of) your favourite microprocessors on-chip, too. To summarise: the issue is system complexity - safety is related (probably exponentially) to the inverse of complexity (if only we could measure it) - so reducing complexity is the key to increasing safety; can we make progress by exploiting analog techniques? -- Martyn Thomas, Praxis plc, 20 Manvers Street, Bath BA1 1PX UK. Tel: +44-225-444700. Email: mct@praxis.co.uk ------------------------------ Date: Wed, 5 Sep 90 13:44:05 +0100 From: Martyn Thomas Subject: software bugs "stay fixed"? In RISKS 10.30, Robert L. Smith writes: ... "... the reliability advantage software has over hardware and people system components, which is that once a software bug is truly fixed, it stays fixed! In contrast consider the many times you repair hardware only to see it fail again from the same cause ..." I don't know how you define "truly fixed" (unless it means that the bug doesn't recur - in which case the claim that it stays fixed is tautological!). In my experience, software bugs are often reintroduced (which is why regression testing is important). This source of problems is probably only surpassed by the number of *new* errors introduced while fixing old ones. The problem of re-assuring software after "maintenance" is as hard as the problem of assuring it in the first place - while the industry practices are probably worse, and the regulatory control is certainly worse. Experience with "software rot" in past systems suggests that we may well see accidents caused by "faulty maintenance" in growing numbers over the next few years. I predict that the individual staff will be blamed, rather than the whole regulatory structure (whereas a major accident caused by an ab initio design error would raise the question of how the error managed to get through the certification process). Somehow, "maintenance errors" sound less threatening, possibly because they sound as though they only apply to a single system. ------------------------------ Date: Wed, 5 Sep 90 12:19:30 +0100 From: tmal@computer-lab.cambridge.ac.uk Subject: Re: Stonefish mine In RISKS DIGEST 10.30 Chaz Heritage wrote in reply to a message from Pete Mellor: > > 4. Does Stonefish rely on some sort of sonar transponder > > to distinguish friend from foe? > > I imagine not, since if the mine were to transmit sonar in order to trigger > transponders located on friendly ships then it would render the mine very > susceptible to detection and countermeasures. I don't know whether Stonefish is able to trigger transponders on detected ships but let us assume that it can. We already know that Stonefish performs pattern recognition on passing ships to distinguish friend from foe. There are also some types of hostile ships that should not be attacked, for instance we would like the mine to remain undetected as a minesweeper passes. If the mine has already decided that a ship should not be attacked, because it has been deemed friendly or a hostile minesweeper, then it need not trigger the transponder. Only if it has already decided to attack a ship would it need to confirm its decision and so try to trigger the transponder. If there is no response then the mine intends to explode and so will almost certainly be detected very shortly afterwards. The decrease in risk to friendly shipping may make such behaviour worthwhile; the additional warning that a foe would receive would be of the order of the round-trip time for the message pair. Mark Lomas (tmal@cl.cam.ac.uk) ------------------------------ Date: Wed, 5 Sep 90 10:51:03 EDT From: davidsen@crdos1.crd.ge.com (bill davidsen) Subject: Re: Stonefish mine | From: chaz heritage:wgc1:RX | Date: 3-September-90 (Monday) 4:34:36 PDT | | > 6. The sophistication of Stonefish's recognition system argues for some | kind | of artificial intelligence. If it's that smart, would it know who was | winning and change sides accordingly?< | | Personally I wouldn't consider Stonefish to be an AI. I don't think the | problem posed is much of a risk to Stonefish operators.... | If, on the other hand, Carlos Cardoen is telling fibs (which would not | perhaps be entirely out of character) then it's possible that he's sold the | Iraqis a few Stonefish already. If so, it seems unlikely to me that they'll | work properly without reprogramming for the target signatures of US and UK | shipping. Here's a real risk of software... after the mines are reprogrammed how would you like to be the first one to run a ship over one to verify that they are ignoring "friendlies?" Since Iraq doesn't have enough ships to worry about this, they don't have the problem, but if they blew the bottom out of a tanker they might really shut off the flow of oil. I believe the mines huddle on the bottom and wait until they detect a target close enough to be damaged then pop to the surface. Somewhat like a "Bouncing Betty" mine, for those of us old enough to remember. bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix. ------------------------------ Date: Wed, 5 Sep 90 18:06:16 GMT From: wdr@wang.com (William Ricker) Subject: S-W controlled mine Risks to Aircraft carriers (Re: Stonefish) I enjoyed the speculation "From Channel 4 news last night (Tue. 28th Aug)" about a software-controlled mine. However, after recent discussions in the sci.military / military-request@att.att.com list/group (initiated after I repeated something I heard a US Admiral say on CBC repeated over US NPR), I must quickly comment on: In comp.risks, p.mellor@uk.ac.city (JANET) writes >It is reported that Iraq may be deploying some of the Royal Navy's latest >high-tech weaponry. Apparently this is causing US commanders to be reluctant >to send aircraft carriers into the northern area of the Gulf. Carriers are not in the gulf because it it too small to maintain normal flight operations in -- the standard exclusion zones around a carrier task group would include oil platforms, Saudi, and Iran; and if operating in north gulf, Iraq. They also can't steam east or west for wind-accross-deck very long either, but that is less of a concern. Stonefish may deter smaller ships and the Battleships (the forgotten class of capital ship) from approaching for bombardment, but they're irrelevant in that overgrown estuary for carriers. /bill ricker/ wdr@wang.com a/k/a wricker@northeastern.edu ------------------------------ Date: Wed, 05 Sep 90 11:04:23 EDT From: Dave Davis Subject: Reply to "Computer Unreliability" Stars vs Selves In response to Peter Mellor's challenge in RISKS 10.28, 31 Aug 90, let me offer the following. On the surface, it would seem that the authors of the _Futures_ article have a fresh point of view about the risks of using computers in areas where the cost of failure is high, avionics, automated medical devices, nuclear reactors, etc. Systems based on large quantities of software do have large numbers of states, and therefore, large numbers of failure points. I addition, such a system may have previously unknown (to the developers) states caused by errors or outside factors, such as the EMI-caused failures of the Blackhawk helicopters. However, the arguments the authors present (as summarized by Mr. Mellor) are somewhat similar to previous objections to utilizing relatively immature technologies. That is, "we don't understatnd it well enough, so let's not trust it" is the underlying point. Almost any significant new technology fits that argument. Historically, it has been through applying a technology that motivation toward better theoretical understanding is created. For example, we didn't understand thermodynamics and statistical mechanics while we applied steam power for several generations. In addition, it is significant that the authors object to the use of statistics as a measurment technique. One wonders if this is an attempt to play on the commonly held bias against the use of statistics. Statistics are routinely used by all large manufacturing companies to identify production problems. In broader sense, the authors misunderstand how broad the implementation of an information-intensive system can be. It is not necessarily just silicon and software. One is reminded of the complexity of the mechanical rail switching systems described so well in previous RISKS. The argument that discrete-state machines have inherently wilder failure modes that an analog systems isn't so. Any system that has feedback, intentional or unintentional, may behave wildly if a component fails, or it is operated outside design limits. (In the early 70s some airliners were thought to have crashed due to the pilots over-extending their controls.) Moreover, returning to the era of electromechanical devices that wear out and have their own idiosyncracies is not a path toward increased reliability. Dave Davis, MITRE Corporation, McLean, VA ------------------------------ Date: Tue, 4 Sep 90 21:45:06 MDT From: hoover@cs.ualberta.ca (Jim Hoover) Subject: "Wild Failure Modes" in Analog Systems Hmm, last time I taught a hardware course I emphasized that the digital computer was just a fiction invented by us theory types. All the implementations I know of use analog devices. Thus we already comply with the suggested legislation. Jim Hoover, Dept. of Computing Science, University of Alberta, Edmonton, Canada T6G 2H1 | 403 492 5401 | FAX 403 492 1071 | hoover@cs.ualberta.ca ------------------------------ Date: Wed, 5 Sep 90 11:23:01 -0400 (EDT) From: "Richard D. Dean" Subject: Re: Wild failure modes in analog systems >From: Pete Mellor >Synopsis of: Forester, T., & Morrison, P. Computer Unreliability and >Social Vulnerability, Futures, June 1990, pages 462-474. >In contrast [to digital computers], although analogue devices >have infinitely many states, most of their behaviour is >*continuous*, so that there are few situations in which they >will jump from working perfectly to failing totally. Although analog behavior is continuous, what about resonance ? While the output may still be a continuous function of some inputs, it's certainly very non-linear in some places....Watch the voltage (or current) on an RLC circuit go very high given the right (or wrong) frequency. Drew Dean rd0k+@andrew.cmu.edu ------------------------------ Date: Wed, 5 Sep 90 14:49:04 CDT From: Will Martin Subject: Re: "wild failure modes" in analog systems >it is now well known that analogue devices can >also (through design infelicities or just the perverseness of the universe) do >inherently "wild" state switches. The classic example is the simple dribble of >water from a faucet, which, in the absence of analogue catastrophes, would be a >steady stream, or an equally spaced series of droplets, but is instead a series >of droplets whose size and spacing is unpredictable except statistically. While this is indeed true, I think that you have to look at the "level" of the possible state change to see the analog/digital difference. In the example cited, while each individual droplet is of unpredictable size and falls at (generally) unpredictable intervals, stepping back from the action and looking at the entire system (water pouring from the faucet) gives a predictable result -- over a period of time, a certain amount of water will flow out of that faucet. There is not likely to be any sudden change in the rate of flow, nor is the flow likely to suddenly stop (assuming nobody is messing with the controls and there are no foreign objects in the water supply to clog the outlet). So while the individual elements (droplets) of the flow follow chaotic paths, the flow, as a whole, follows a predictable route. In a digital system without adequate limiting controls, each succeeding digital number could vary wildly from the preceeding one. A high-order bit could be turned on, for example, causing an effect that just could not happen in an analog system, simply because it takes time for a change to occur; analog variables can "ramp" up or down but each instance will depend, to some extent, on those preceeding. Each digital sample, though, can stand alone and enormous swings can occur in the interval of milliseconds or nanoseconds between samples. Thus the possible range of catastrophic effects are inherently greater in digital as opposed to analog systems. (Of course, well-designed digital systems with limit checking and sample verification can avoid such ill effects.) This doesn't mean that analog systems can't suffer similar catastrophes. In the example given, a lump of something in the water supply could clog the valve or nozzle in an instant. So the flow could drop to zero in a shorter-than-normal time. But that is about all that could go wrong. The flow couldn't change from 1 liter/minute to 1 billion liters/minute in an instant, or switch to a reverse-direction flow. A digital equivalent would be subject to such possibilities. Will wmartin@st-louis-emh2.army.mil OR wmartin@stl-06sima.army.mil ------------------------------ Date: Wed, 5 Sep 90 22:02:11 PDT From: Pete Mellor Subject: Re: "wild failure modes" in analog systems Kent Paul Dolan in RISKS-10.30 writes about the "wild" (or "catastrophic" in Forester and Morrison's original terms) failure modes of analogue systems. He states: > Unless my understanding from readings in Chaos Theory is entirely flawed, the >second sentence is simply false; it is now well known that analogue devices can >also (through design infelicities or just the perverseness of the universe) do >inherently "wild" state switches. The "second sentence" here is: >>In contrast [to digital computers], although analogue devices >>have infinitely many states, most of their behaviour is >>*continuous*, so that there are few situations in which they >>will jump from working perfectly to failing totally. First, let me say that I *almost* entirely agree with Kent. After all, chaotic phenomena were originally demonstrated on analogue systems. In that synopsis, I was trying to present the authors' view without prejudice. I did not pick that particular bone with them in my subsequent criticism of their paper since I had plenty of other points to raise. Kent goes on to say: > So, if the original authors' intent in demeaning our increasing > reliance on (possibly "un-failure-proofable") digital systems is > to promote a return to the halcyon days of analogue controls, > this is probably misdirected by the time the controls approach > the order of complexity of operation of the current digital ones. I agree again, *but*, we would never attempt to build systems of the complexity of our current digital systems if we had only analogue engineering to rely on. It would not be possible. Reliability requires simplicity. Analogue systems would be expected to be more reliable than digital because they are forced to be simpler. It is the complexity of the software in a digital system which leads to its unreliability. > We may just have to continue to live with the fact, true throughout > recorded history, that our artifacts are sometimes flawed and cause > us to die in novel and unexpected ways, and that we can only do our > human best to minimize the problems. Of course! No human endeavour is free of risk. However we do have a choice: a) to restrict the complexity of life-critical systems, in the hope of retaining some kind of intellectual mastery of their modes of failure, and b) to stop kidding ourselves that software failure makes an insignificant contribution to the unreliability of digital systems (and we *do* - see below [1]). Returning to chaos (as properly defined: non-linear behaviour of a system, whose basic laws are well-understood, such that second-order effects predominate and the future states of the system become unpredictable at the detailed level since arbitrarily close points in the state space can diverge along widely differing circuits): how does this differ from digital system behaviour? When Christopher Zeeman gave a lecture on Chaos at City, I asked him what he thought was the relevance of chaos theory to digital systems. To my surprise, he (and, I would guess, 99.99 per cent of other chaotists) had never given the problem a single thought! Hardly surprising, if you think about it. There *is* no physical theory of digital behaviour, and no distinction between 1st and 2nd order effects (or, if you like, everything is at least 2nd order: the slightest perturbation from a point in the state space can lead to *anywhere* arbitrarily quickly). Has anyone out there thought about what the state space diagram of a modest digital device would look like? The closest I could get was a billion-dimensional discontinuous space of 0's and 1's (i.e. the Cartesian product, a billion times over, of {0, 1} with itself). Yuk! A serious attempt *has* been made (by John Knight - reference not to hand) to examine the shapes of bugs in programs, i.e. the topological properties of those subsets of the input space which activate program faults. Chaotists will be pleased to learn that they were fractal. So here I side with Forester and Morrison. Although I agree with Kent that analogue systems can behave chaotically, digital systems are far, far more chaotic than chaos! Just an observation in passing. [1] By the way, the reference above to belief in the perfection of software is based on what a representative of Airbus Industrie said when interviewed on the last Equinox programme on fly-by-wire (see RISKS passim). UK viewers (and some elsewhere in Europe) should tune into Channel 4 on Sunday 30th September, when an updated version of this programme will be transmitted. Approximately 50 per cent of the material is new, including some *very* interesting stuff on the Mulhouse-Habsheim disaster. Peter Mellor, Centre for Software Reliability, City University, Northampton Sq.,London EC1V 0HB +44(0)71-253-4399 Ext. 4162/3/1 p.mellor@uk.ac.city (JANET) ------------------------------ End of RISKS-FORUM Digest 10.31 ************************