8-Mar-87 16:23:01-PST,14871;000000000000 Mail-From: NEUMANN created at 8-Mar-87 16:21:49 Date: Sun 8 Mar 87 16:21:49-PST From: RISKS FORUM (Peter G. Neumann -- Coordinator) Subject: RISKS DIGEST 4.58 Sender: NEUMANN@CSL.SRI.COM To: RISKS-LIST@CSL.SRI.COM RISKS-LIST: RISKS-FORUM Digest Sunday, 8 March 1987 Volume 4 : Issue 58 FORUM ON RISKS TO THE PUBLIC IN COMPUTER SYSTEMS ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Contents: The Sperry Plan, FAA Certification, and N-Version Programming (Nancy Leveson) (LONG MESSAGE) The RISKS Forum is moderated. Contributions should be relevant, sound, in good taste, objective, coherent, concise, nonrepetitious. Diversity is welcome. (Contributions to RISKS@CSL.SRI.COM, Requests to RISKS-Request@CSL.SRI.COM) (Back issues Vol i Issue j available in CSL.SRI.COM:RISKS-i.j. MAXj: Summary Contents Vol 1: RISKS-1.46; Vol 2: RISKS-2.57; Vol 3: RISKS-3.92.) ---------------------------------------------------------------------- Date: 07 Mar 87 10:34:16 PST (Sat) From: Nancy Leveson To: JCK.UVACS@relay.cs.net, neumann@csl.sri.com Subject: The Sperry Plan, FAA Certification, and N-Version Programming ReSent-To: RISKS@CSL.SRI.COM Since my original message to Risks about the Sperry plan, I have visited the FAA Certification Office in Seattle and, for other reasons, the Boeing Commercial Aircraft Co. in Seattle. The Boeing employees I spoke to told me that they have rejected the Sperry plan for their software. However, it is planned to be used for a Category III autopilot for the MD-11. This autopilot is safety-critical for 30-45 seconds during autolanding. There is also great dependence on n-version programming for the fly-by-wire Airbus 320, but I have no details about the A320 software development such as the testing procedures used (except for a blurb in Aviation Week that states that n-version programming provides "optimum" protection against software errors in that aircraft). To remind Risks readers, Sperry wants to use N-version software in place of "white box" (structural) testing. Black box (system) testing will still be performed. System testing would be done "back-to-back." Back-to-back testing means that multiple versions of the software are executed on the same test data. If they all agree on the answer, it is assumed that they are all correct. Kevin Driscoll writes in Risks 4.55 that this plan is not really so scary. In order to follow the discussion, it is necessary to have some background on aircraft software certification. There is a document used by the FAA (and written primarily by the manufacturers) called RTCA/DO-178A: Software Considerations in Airborne Systems and Equipment Certification. It lays out the requirements for software development, testing, configuration control, and documentation. The requirements are pretty basic -- about what I would recommend for a good inventory program. In general the requirements are: (1) developing software requirements and verifying them against system requirements [no requirement for any formality in the process]; (2) using a design discipline or method [not specified as to which one -- just says you need to use one] that makes software traceable, testable, maintainable, and understandable; (3) doing a design review against requirements; (4) using an implementation technique that is understandable, testable, and traceable to the design; (5) doing requirements-based and structure-based tests including module testing, module integration testing, and hardware/software integration testing including a requirements coverage analysis and a software structural coverage analysis [this is what Sperry wants to eliminate except for system test] (6) providing software configuration management; (7) providing a quality assurance plan. The document specifies software function criticality categories of: Level 1 (flight-critical: failure prevents continued safe flight), level 2 (flight essential: functions reduce capability of aircraft or ability of crew to cope with adverse conditions), and level 3 (non-essential: failures could not significantly degrade aircraft capability or crew ability). The difference in criticality level seems to determine what information is provided to the FAA for certification and, in some cases, which of the above requirements are enforced. For level 1 software, for example, the manufacturer must provide detailed information about the verification that was done. For level 2, in general only a summary description of the process along with a statement of compliance must be submitted. For level 3, no assurance is required. In terms of certification effectiveness, independent evaluation is possible only with information. So providing just a statement of compliance seems to me to imply that no external, independent evaluation is possible. There is no way to check that they actually did comply and that the verification that was done was adequate and correct. I certainly do not want to imply that the manufacturers and subcontractors will not try to do the best job possible -- after all, they have the liability and they are decent human beings who care about human life. The problem is that without external review we are depending on the competence of the people at these companies, and I am not as sanguine about the general state of software engineering knowledge and practice in industry as I am about the good intention of humans. So far, though, things are not really TOO awful, but wait ... The problem seems to arise from one sentence (which was added between version 178 and 178A and seems to be the major change) that states "Using appropriate design and/or implementation techniques and considerations, it may be possible to use a software level lower than the functional categorization." This is the kicker. Sperry is arguing that although the software autopilot is Level 1, they are using n-version programming and therefore it can be treated as Level 2. There is also a phrase "the software level implies effort that ... may vary within criticality level." So they can modify any of the requirements also, it appears, within level (given that the FAA agrees). BOTTOM LINE: even those very basic requirements that are specified above can be eliminated fairly easily. Personally, I would require MORE than is stated in DO-178A for both Level 1 *AND* Level 2 software development and verification. As examples of what is possible, the DoD, besides requiring good software engineering practice, requires a safety and hazard analysis of the software. The Air Force and Navy also require an IV&V by a qualified company (Logicon does a lot of this) for all nuclear systems [called Nuclear Safety Cross Check Analysis by the Air Force and Software Nuclear Safety Analysis by the Navy]. These IV&V efforts are MUCH more rigorous than anything the FAA appears to be doing. Note that the DoD requires proof of the safety of the software itself and not just proof that the developers have satisfied minimal development practices. The most amazing part of the RTCA document is the fact that using a particular method, such as n-version programming, can somehow magically change the criticality level of the software (from flight-critical to flight-essential or non-essential). Since the function of the software does not change with the development method, this appears ridiculous. I can only assume that they are arguing that the reliability will be so high that failures will never occur and therefore the criticality of the function is irrelevant; this is the only interpretation that makes sense to me. But there is no current software engineering technique that can guarantee such ultra-high reliability! (including N-version programming). And since they dismiss in the document the use of any measurement techniques (they state that currently available methods do not yield results in which confidence can be placed to the degree required) and don't even mention any formal verification methods, there is NO demonstration required that they have reached perfection (or any particular level of reliability) using the particular design or implementation technique. In the Sperry case, their argument for N-version programming appears to rest on a simplistic model presented by Larry Yount at an AIAA Conference in Long Beach. This model assumes statistical independence between failures of the n versions. This assumption has been shown to not hold in controlled experiments and, in fact, is not believed by most researchers in the field. At a workshop this summer, Larry put up a chart that showed his model predicted 20,000 times improvement in reliability based on the use of n-version programming. Since actual experiments have found at best only 7-10 times improvement, his figures appear to be patently ridiculous. Kevin Driscoll (Risks 4.55) states: > In its letter to Sperry, the FAA says that this method "appears to be > satisfactory" with the following constraints: > a. Level 1 must be used for paragraphs 6.2.2 (Requirements Development > and Verification) and 6.2.3 (Design). > b. Formal configuration control must used and, if common errors are > found, structural testing may be required for some or all of the > modules. Common errors have been found in EVERY experiment done so far in n-version programming (at least, in all that have checked for them which is about 4 or 5). The problem is that with only three versions of the software and the use of back-to-back testing, the only common errors that can be detected are those within only two modules. Any common errors found in all three of the modules cannot be detected (unless some outside method of correctness determination is used). In my experiment with John Knight at the University of Virginia, we found common failures in up to eight independently developed programs. Also, any errors that can be traced back to the specification will, of course, have a tendency to manifest themselves in common between the versions. > c. Formal review and comparison of source code must be used to verify > dissimilarity. Where this is not feasible, Level 1 structural test > and analysis must be used. How does one verify dissimilarity? In fact, how does one even define it? Obviously the programs must be similar in that they are computing the same function. The only dissimilarity we really want is a dissimilarity in failure behavior. Syntactic dissimilarity is irrelevant. Again, John Knight and I found programs that used completely different algorithms to compute a function yet failed on all the same input data. The problem is that certain input cases are inherently more difficult to handle. For example, when computing the angle determined by three points, programs tended to fail on inputs where the points were colinear or coincident. The errors were not the same nor were the algorithms, but they failed on the same input data. So looking to see that different algorithms are used is not adequate. This is the problem in talking about a concept like "dissimilarity" or "diversity" without ever formally defining it; there is no way to know whether you have it nor any way to measure it. It is similar to the problem with using the term "artificial intelligence" when the term "intelligence" remains undefined. One can merely claim that their program is intelligent and it is difficult to dispute it (or to prove it either). How does one prove or disprove that dissimilarity or diversity exists? > It seems to me that c. is the same as doing structural analysis. > Therefore, this method is not any less rigorous than "full" DO-178A > Level 1. I can see no relationship between verifying dissimilarity between two or three programs and structural analysis of the correctness of a single program; especially given that I know how to do the second but not the first. I am not quite sure what Kevin means by "less rigorous." Certainly, we have much more experience with structural testing than with n-version programming. There is no evidence anywhere that structural testing is equivalent to n-version programming (e.g., that they detect the same errors) nor that one can replace the other. Although somewhat beside the point, I would argue that even the *FULL* DO-178A is not nearly rigorous enough for safety-critical software. > d. Functional tests of the system must be performed. It must be shown > that the system will not have false alarms. > However, how one complies with c. and d. I do not know. THAT IS THE WHOLE POINT! Sperry is suggesting replacing something we know how to do with something nobody knows how to do and has never been shown to work with the degree of effectiveness required. I would certainly feel happier if the Sperry plan were tried first on real software that was not Level 1 or Level 2 (by real software, I do not mean just university or industrial experiments where the software is never used in a real production environment). I have few qualms about N-version programming being used in conjunction with normal software development techniques even on safety-critical software. But I have grave reservations about eliminating any testing or other standard procedures on the basis of using it. The problem, of course, is that developing multiple versions is expensive. So I assume Sperry is trying to cut down on testing in order to save money. Unfortunately, I do not know how to develop safety-critical software cheaply. For the most part, greater reliability and safety requires more money. Just using some sleight of hand to relabel the software as Level 2 or Level 3 instead of Level 1 does not make it any less safety-critical. And voting together relatively untested and unverified single versions has not been shown (in the experiments that have tried it) to guarantee high reliability or safety. In fact, the little experimental evidence available has shown that as the number of errors in the individual versions increases, the amount of reliability gain to be expected fron using n-version programming decreases. I am still worried despite Kevin's attempt at reassurance. Nancy Leveson, University of California, Irvine ------------------------------ End of RISKS-FORUM Digest ************************ -------