RISKS-LIST: RISKS-FORUM Digest Thursday 13 July 1989 Volume 9 : Issue 4 FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Contents: Air Traffic Computer Fails 104 Times in a Day (Rodney Hoffman) A320/MD-11 F-B-W differ on pilot authority (Mark Seecof, Rodney Hoffman) UK MoD S/W Std -- "Crystal Clock" Architecture (Bob Munck) A Biological Virus Risk (Frank Houston) Software engineering models -- an apology (Rich D'Ippolito) The RISKS Forum is moderated. Contributions should be relevant, sound, in good taste, objective, coherent, concise, and nonrepetitious. Diversity is welcome. * RISKS MOVES SOON TO csl.sri.com. FTPable ARCHIVES WILL REMAIN ON KL.sri.com. CONTRIBUTIONS to RISKS@CSL.SRI.COM, with relevant, substantive "Subject:" line (otherwise they may be ignored). REQUESTS to RISKS-Request@CSL.SRI.COM. FOR VOL i ISSUE j, ftp KL.sri.com[CR]login anonymous (ANY NONNULL PASSWORD)[CR] get stripe:risks-i.j ... (OR TRY cd stripe:[CR]get risks-i.j Vol summaries (i.j)=(1.46),(2.57),(3.92),(4.97),(5.85),(6.95),(7.99),(8.88). ---------------------------------------------------------------------- Date: 12 Jul 89 07:59:01 PDT (Wednesday) From: Rodney Hoffman Subject: Air Traffic Computer Fails 104 Times in a Day Summary of an article by Jeffrey A. Perlman in the `Los Angeles Times' 12 July 1989 is headlined AIR TRAFFIC COMPUTER FAILS 104 TIMES IN A DAY: Unspecified hardware problems caused 104 brief failures in a new multimillion-dollar computer system throughout the day Sunday at the Coast Terminal Radar Approach Control, known as Coast TRACON, in El Toro. The failures, up to five minutes in length, wiped altitude, speed, aircraft identification, and radio frequency data off of controllers' screens, leaving only aircraft blips or targets. The new system has no backups, unlike the aging system it replaced 2.5 months ago. Furthermore, a decoder which could have provided some of the lost information from aircraft transponders, is inoperative because parts are on back order. During the failures, controllers had no way of knowing an aircraft's altitude or identity except through voice communications. Controllers said the failures endangered air safety, although the FAA minimized the hazards. Although there were no "near misses," many aircraft departing from LA International airport did encounter short delays because controllers were swamped. An emergency technical crew was flown in from New Jersey and worked all night Monday to correct the problem. A controller spokesman said there were still at least three more failures as of Tuesday afternoon, although the FAA's assistant regional manager said he was unaware of any further computer failures after Monday night's repairs. The National Air Traffic Controllers Association has previously filed complaints with the FAA about the El Toro system and about TRACON facilities elsewhere that have received the same system. [MORE SUBSEQUENTLY FROM RODNEY:] The previously unspecified hardware problems are now spelled out: According to Jim Panter, manager of the facility, "no new computer system was involved. It was stictly old equipment that needed to be replaced." The series of computer outages was traced to four brittle, broken wires inside aging memory storage units. Four of the Univac computer's 11 memory modules, in use for at least 12 years, went out on Sunday because of the wires. A heat build-up aggravated the situation, with the computer system shutting itself down and restarting, catching up with new data, as it tried on its own to locate the memory storage problem. A special crew from the FAA's technical center in Atlantic City, NJ, worked on successive nights this week to solve the problem, but because of a parts shortage some outages still occurred Monday and Tuesday, Panter said. There were no outages Wednesday, he said. The computer system involved has been in use at least since 1972, although it has been upgraded several times with new software. The software was not involved in Sunday's failures, although it has been involved in problems at other TRACON facilities. "Just like any program, the new software has had its glitches," said FAA spokesman John Leyden in Washington. [This item was also noted -- briefly -- by Dave Curry. PGN] ------------------------------ Date: Wed, 12 Jul 89 11:23:44 PDT From: lcc.marks@SEAS.UCLA.EDU Subject: A320/MD-11 F-B-W differ on pilot authority These excerpts from "Flying the Electric Skies," M. Mitchell Waldrop, Science (30 June 1989, p.1532ff, v.244). Elisions... and [bracketed paraphrasing] are mine as part of condensation. Comments follow. [...] McDonnell Douglas... thinks [pure fly-by-wire isn't quite mature, so it's]... taking a different tack with its new MD-11, a three-engine widebody scheduled to start service in late 1990. The MD-11 is even more highly automated than the A320 in many ways. ... But the MD-11's computerized controls most emphatically will have mechanical backups. ``The pilots have to have control in all situations, not just the normal ones,'' says Douglas' Miller. ``A lot of what ended up in the Airbus got done because it was neat,'' claims Joel Ornelas, manager of the MD-11 design effort. ``Engineers love it. But the pilots...?'' [...] Airbus [doesn't agree. They think their stuff is great and saves weight and can't fail; but]... the A320 does have a partial backup system: a set of cables running back to the tail and rudder. In an absolute power failure those cables are supposed to let the crew keep the airplane under control until they establish emergency power--or if need be, longer. ``In test flights we've demonstrated {operation with the backups} from cruise... to landing... which is more than they were designed to do.'' [... However, the pilot of an A320] is... going to have to live with the A320's preprogrammed restrictions on what it will do. This guardian angel behavior, known more formally as the Flight Envelope Protection System, is widely considered by aviation professionals to be... far more revolutionary... than fly-by-wire per se. In effect, the A320's designers have decreed that their judgement about the aircraft's limits will always take precedence over the pilot's judgement. And that is not a constraint that any pilot can take lightly. ``*Nothing* must have the authority to forbid the pilot to take the actions he needs to,'' Says McDonnell Douglas' Miller. The problem with giving away that authority to a computer, he says, is that ``a computer is totally fearless--it doesn't know that it's about to hit something.'' Imagine, for example, that an A320 pilot makes a sharp turn to avoid an imminent collision, or else dives and then has to pull out to avoid the ground. No matter how desperately he or shell hauls back on that sidestick, the envelope protection system will not let the airplane respond beyond a certain rate: namely, the rate that limits stress on the airframe to... 2.5G. ...this seems like plenty. [It would] require quite a violent maneuver [to pull 2.5G]. ...But consider... China Airlines Flight 006 on 19 February 1985. While cruising at 41,000 feet some 300 miles NW of San Francisco... the 747 suffered [power loss and cockpit confusion leading to a] near vertical dive. Over the next three minutes it plunged nearly 6 miles, until the captain was able to maneuver it back into level flight at only 9500 feet. He measurably warped the wings. He caused several million dollars worth of other structural damage. But he saved the airplane and passengers. And to do it he had to pull an estimated 5.5G, or more than twice the A320 limits. Airbus, not surprisingly, responds that an A320 never would have [entered the dive] in the first place. ...Be that as it may, notes [747 pilot] Waldrip, many pilots do like to fee that they can bend or break the aircraft [if they must]. [McDonnell Douglas agrees.] ``Our strategy... is that the pilot must have overriding authority,'' says Miller, ``and he must be able to exercise that control by the normal method''--the same eye-hand-brain control loop that a pilot develops through... experience. For example, imagine a pilot caught by wind shear, a sudden and very dangerous burst of cold air falling out of a rain cloud. The best thing to do in such a situation is to pull the nose up and put the power to the wall. Never mind if the engines have to be rebuilt later; you're just trying to keep from hitting the ground. However, this is hardly the time to be thinking about things like special override switches, says Miller. ``You want your normal... actions to always have the expected effects.'' So on an MD-11, the pilot would push the throttle all the way forward until it stopped to get the full rated power of the engine, with the built in limits providing the same kind of security as the A320's envelope protection system. But then by doing the instinctive thing-- pushing very, very hard-- he could break through to a regime the A320 absolutely forbids: extreme thrust, well beyond the rated power. And so it goes throughout the MD-11, says Miller. The limits are there for safety, ``but to override you just have to pull hard or push hard.'' [...] The Mac-Dac engineers seem to have worked hard on the ``human-factors'' stuff. I'm struck by the thought that `push hard to override' is rather like "pounding on things," which has sort-of appeared in office computer interfaces; for example... giving a likely-to-be-an-error command to the vi text editor will provoke a message the first time you try it, but if you repeat the command immediately vi will shrug and do what you want. It seems to me that risks from faulty computer-human interaction probably can be reduced by this sort of thing. When time is critical, it's handy if "more force" gets you what you want without fumbling around for some obscure override control. The issue of "who's in charge here, anyway?" is more interesting. The A320 approach assumes that pilot error is more likely than proper-though- unusual control motions. Since pilots are still more versatile than the computers in their planes, they should probably have the last word. Also, pilots can learn new tricks faster than computers can receive certified software upgrades. Post-aircrash review (and recently, simulator recreation/experimentation) often produces "how-to" bulletins for pilots to deal with unusual situations. It might be a long time before that stuff could be translated into revised FCS software. Mark Seecof, Locus Computing Corp. Los Angeles (213-337-5218) My opinions only, of course... ------------------------------ Date: 12 Jul 89 14:52:10 PDT (Wednesday) From: Rodney Hoffman Subject: Fly-by-wire overview [More excerpts from the same 'Science' article as in the foregoing... PGN] The stories briefly survey the philosophy, state of the art, and controversy of fly-by-wire aircraft and (overly-?) protective new extensions. Designers on both sides of the argument, as well as pilots, are given their say. [...] Airbus engineering test pilot Udo Guenzel: ``Suppose you suddenly find yourself staring at a Cessna that has wandered into your airspace. So you swerve. Now, in a standard airliner, you would probably hold back from maneuvering as hard as you could for fear of tumbling out of control, or worse.... But in the A320, you could just slam the controller all the way to the side and instantly get out of there as fast as the plane will take you.'' The A320 has five separate computers, "redundant software obtained from two different vendors to minimize the possibility that the same bug will appear simultaneously," heavy shielding on its data cables to keep out electromagnetic interference, "and, yes, the A320 does have a partial [mechanical] backup system." The sidebar reviews more of the history of fly-by-wire and flight automation philosophy: "We started out with cockpit automation backwards," says Northwest Airlines 747 pilot Kenneth Waldrip. In the 1970s and early 1980s, he says, "the idea was that the computers would fly the plane and the pilot would monitor them in case anything went wrong...." There was only one problem with that scenario, Waldrip says: humans are absolutely terrible at passive monitoring.... People get bored. Their attention flags. They start missing things. Worse, a passive pilot would often have to tackle an emergency cold.... By the mid-1980s, aircraft designers, pilot trainers, and the aviation community generally had gone through a 180-degree turn in their concept of what automation should do. The new philosophy, which often goes under the name of "human-centered" automation, was illustrated in 1980 in a seminal paper by human factors researchers Earl Wiener (Univ. of Miami) and Renwick Curry (NASA Ames Research Center). They used the image of an "Electric Cocoon" [similar to the Flight Envelope Protection System of today's A320]. Instead of having people watch the machines, human-centered automation means having the machines watch the people.... and it means putting automation back on the right track: as an assistant to the pilots. ------------------------------ Date: Thu, 13 Jul 89 14:31:52 EDT From: Bob Munck Subject: UK MoD S/W Std -- "Crystal Clock" Architecture I can't let go unchallenged Norm Finn's and Nancy Leveson's statements (RISKS-9.2, .3) on what are variously called "polling," "time-slot," or "crystal clock" system architectures: "Polling is absolutely the safest and most-used method for programming embedded systems for saftey-critical applications." "The trivial scheduling algorithm makes the interaction between the routines vastly easier to write and verify than is the case in a system that can switch tasks at random times." "Software with these characteristics is safer because it is deterministic rather than non-deterministic -- the number of states may often be reduced to a small enough number to perform extensive (and sometimes exhaustive) testing and analysis." I don't want to comment on the _theoretical_ truth of these statements; that argument may never be resolved. However, in my personal experience with major DoD systems over many years, this approach to system architecture has led to many more absolute abominations that any other. ^^^^^^^^ ^^^^^^^^^^^^ The major problem, I believe, is that it does not work well with normal design and coding team organization and management. The approach AS PRACTICED seems to grind together the pieces of any original functional organization, breaks them into small pieces, and then "shotguns" them all over the time-slot hierarchy. Traceability from requirements to code, never our strong suit, is lost. "Invisible dependencies" become the order of the day: code in major cycle XXYa34/minor cycle Ybbb_CAL assumes that variable IXXm3_AP was set correctly by a side-effect of code in XXZdd3/PQ88_SET/ECP9347q; when XXZdd3 is reorganized because it wasn't fitting in its time-slot, the assumption became incorrect. This is first noticed when the $100 million satellite attempts to orbit at -800 miles altitude. Because our teams are large, scattered, ill-managed, and leavened with the mediocre, the possible advantages quoted above are overwhelmed. I feel about these dueling approaches as I do about HOLs: a good programmer/team/company can write a good system in any language, and a bad one can write a bad system in any language. The time-slot systems I've seen might have been even worse had they been designed to be multi-tasking; the choice of architecture was not the root cause of failure. It may be a symptom. -- Bob Munck, MITRE Corporation, McLean ------------------------------ Date: Wed, 12 Jul 89 12:54:43 -0400 From: houston@itd.nrl.navy.mil (Frank Houston) Subject: A Biological Virus Risk Over the past six years, FDA has documented an increase in medical device problem reports of and voluntary recalls for system design errors and computer programming errors. Some of the reported errors have compromised patient safety. Included are blood bank computer systems used in controling the release of whole blood and blood products for treating patients. Some of the errors could have allowed institutions to release infected blood products. Such inappropriate release is particularly hazardous to the public health because blood products infected with HIV (AIDS virus) and hepatitis might become available for transfusion. Fortunately, so far, no cases of HIV infection have been traced directly to computer errors, but agency experts and field investigators believe this result is only because the institutions have not depended heavily on software controls to authorize release of blood products. One particularly troubling aspect of the problem, uncovered during an inspection, was a lack of security on serologic data entries. The system allowed more than one operator to edit the same record simultaneously, with the result that the last one to close the record had the final word on the contents of the record. That is to say, one operator could enter a test result of positive for hepatitis and exit while at the same time a second operator entered some other result (the default hepatitis result being negative); so when the second operator exited, the default negative for hepatitis became the permanent entry. This seems to violate well known principles of data integrity, but it happened. The software in question is no longer used, but the problem remains: how does one assure that dangerous bugs have been eliminated from commercial software when one has neither real-time information about nor control over the way it is developed? Frank Houston, FDA/CDRH ------------------------------ Date: Wed, 12 Jul 89 17:25:50 EDT From: rsd@SEI.CMU.EDU Subject: Software engineering models -- an apology [The lateness of this response and to my e-mail is due to a city watermain break that flooded the basement of the building with the cooling equipment in it, shutting down our building for 6 days -- as noted in RISKS-9.1.] First, let me apologize to Stan Shebs and the readers of RISKS for the tone and content of my recent article. It was not intended to be a shoot-the- messenger reply, and I'm sorry that it appears to be. I most likely read the original article as an series of general unsupported charges against a specific former employer who had no way to respond, and took my cue from the clearly provocative title. Let me make the attempt to extract some lessons from the original article and all of the subsequent discussion: Mr. Loux asks if the intent was to discourage whistle-blowers. Of course not; I didn't take the article as whistle-blowing to begin with. The intent was to elicit more specific information on exactly what the charges were and who they were against. Many of the statements were made with little or no support other than admitted hearsay, and I'm sorry I didn't respond to the message instead the statements. I will disregard those statements in the remaining discussion. It was not clear to me that there was a connection between the "model" software engineering methodology and the file transfer problem -- I should not have missed that. Mr. Shebs's clarification of this point and expansion to the comparison of meeting the letter of DoD acquisition requirements with software quality should be seen as evidence for the immaturity of the SE discipline everywhere -- not just for the defense industry. It is the real lack of true engineering methods that leads to the acceptance of "models" where the emphasis and attention is placed on the checking off of boxes on a delivery schedule. After all, the incentive structure is set up to reward schedule adherence, so it is easy to predict that the efforts will be devoted to getting the all of boxes checked at the right times. How can you fault the management for doing what they are paid to do? Even a well-educated management (in terms of technical realism) will have a hard time making the right response under such reward conditions. Many non-defense contractors have found that offering the programmers extra rewards for meeting schedules does not get better software, nor does it get it on time, when the impossible is requested. Mr. Shebs also states that the greatest risk might have been from things that weren't anticipated during testing. I suggest that waiting until testing is too late. The time to do the risk analysis is in the initial stages of design, so that the testing procedures are only used to verify that the implementation is faithful to the design. It is because there is not the clear separation of design from implementation in the software industry that testing is misunderstood and misused. The issue of problem reports and their unclear relationship to software quality is also a result of the failure of traditional methodologies to separate design from implementation, and from their nearly exclusive focus on process. The best process in the world can't be expected to do more than produce economical, consistent copies of poor designs! The problem will be with us until we shift from process orientation to product orientation. I would like to take issue with Mr. Loux's characterization of the problem as being inherent in engineering. True engineers do not respond to absurd or impossible situations. Engineering, as an applied science, has been the basis for reasonable expectations, as engineers are adverse to undertaking projects of any kind without having a base of working models of prior solutions (captured scientific knowledge) to adapt. Without such models, there is nothing on which to build a set of application techniques to apply the knowledge, let alone a way to generate a rational set of tests and expectations for the final product. Which company, defense or otherwise, will be the first to refuse to bid on a project requiring technology that it has never delivered, and which software engineers will be the first to refuse to agree to deliver something for which no models have been proven? Rich ------------------------------ End of RISKS-FORUM Digest 9.4 ************************