Subject: RISKS DIGEST 10.37 REPLY-TO: risks@csl.sri.com RISKS-LIST: RISKS-FORUM Digest Thursday 13 September 1990 Volume 10 : Issue 37 FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Contents: Expert system in the loop (Martyn Thomas) Re: Railway Safe Working - large analogue systems (Dave Parnas) Re: Analog vs digital reliability (Rob Sartin, David Murphy) The need for software certification (John H. Whitehouse) ZIP code correcting software (Richard W. Meyer) Software Bugs "stay fixed"? (Jeff Jacobs) The RISKS Forum is moderated. Contributions should be relevant, sound, in good taste, objective, coherent, concise, and nonrepetitious. Diversity is welcome. CONTRIBUTIONS to RISKS@CSL.SRI.COM, with relevant, substantive "Subject:" line (otherwise they may be ignored). REQUESTS to RISKS-Request@CSL.SRI.COM. TO FTP VOL i ISSUE j: ftp CRVAX.sri.comlogin anonymousAnyNonNullPW cd sys$user2:[risks]GET RISKS-i.j ; j is TWO digits. Vol summaries in risks-i.00 (j=0); "dir risks-*.*" gives directory; bye logs out. ALL CONTRIBUTIONS ARE CONSIDERED AS PERSONAL COMMENTS; USUAL DISCLAIMERS APPLY. The most relevant contributions may appear in the RISKS section of regular issues of ACM SIGSOFT's SOFTWARE ENGINEERING NOTES, unless you state otherwise. ---------------------------------------------------------------------- Date: Thu, 13 Sep 90 18:23:50 BST From: Martyn Thomas Subject: Expert system in the loop According to Electronics Weekly (Sept 12th, p2): "Ferranti will study for MoD the feasibility of integrating a knowledge-based expert system into naval command systems, to advise commanders in battle. "The system would reduce the risk of mistakes [sic] because battle situations are too complex for a command team to appreciate properly in the short time available. [...] If a trial system is built, it will be installed in a type-23 frigate. " So: the battle situation is too complex to understand. The expert system is likely to be too complex to understand, too. The commander is unlikely to ignore the advice of the expert system, unless it is clearly perverse. This means that the decision (say, to launch a weapon) is being taken, in practice, by the expert system. Is there no way we can stop this trend towards automated launch systems, before it becomes completely uncontrollable? It would be a good subject for an international treaty, even though it would be hard to find a way to verify compliance. The Aegis system on the USS Vincennes led to the death of several hundred people when a civil airbus was shot down, on a scheduled flight, in weather conditions where the aircraft would be clearly recognisable from the bridge of the Vincennes through binoculars. That tragedy was ascribed to the poor user-interface of Aegis, combined with an atmosphere of eager tension on board which made a decision to fire more likely. How can we stop people building ever-more-complex decision-support systems, and thereby losing their ability to take decisions themselves? Martyn Thomas, Praxis plc, 20 Manvers Street, Bath BA1 1PX UK. Tel: +44-225-444700. Email: mct@praxis.co.uk ------------------------------ Date: Wed, 12 Sep 90 23:24:14 EDT From: parnas@qucis.queensu.ca (Dave Parnas) Re: Railway Safe Working - large analogue systems (Skillicorn, RISKS-10.36) Although I found the discussion of railway safety mechanisms interesting, the most interesting question for me was why the systems described would be considered "analogue". There seems to be an implicit assumption that anything that is not computerised is "analogue". Conventionally analogue systems are those in which there are an infinite set of stable states, e.g. systems of springs and weights, electrical networks comprising resistors, inductors, capacitors, etc. Digital systems are constructed of components that have been designed to have a relatively small number of stable states with the transitions between those states being so rapid that the time spent in other states is negligible. The most common instances are binary, two stable states, but other numbers have been tried. For example, old telephone switches often used relays with 10 stable states. These were actually hybrid systems because, when the relays were in one of their stable states, analogue signals were conducted between the two subscribers. Strictly speaking digital systems do not exist. Relays, flip flops etc are actually constructed of analogue components, only by ignoring the time in which the circuits are changing between "stable" states. However, the circuits are designed so that one can ignore those times. The transition time determines the clock rate for those systems. Subtle hardware bugs often occur when, because of poor design, the transition times are ignored when they should not have been. The description of the railway safety systems makes them sound like digital systems made of old-fashioned technology. The little arms that were described are considered to be either "up" or "down"; one neglects the time in which they are moving between those states. The buttons described are either depressed or released; one neglects the time in which they are between those two positions. The switches are designed so that they are either "on" or "off" and good switches are designed so that we can neglect the times when they are in an "in between" situation. It is important not to assume that "mechanical" is an antonym of "digital". Looking back at the earliest digital computers one finds that they were made of mechanical components not unlike those described as current railway technology. If we are going to look for examples of large analogue systems, I think we have to look elsewhere. The servomechanisms used in speed control on most vehicles, and for flight surface control on aircraft, would seem better candidates. Of course, just as the digital components are actually approximated by analogue elements, those analogue systems can now be approximated by digital systems. In those approximations the number of stable states is so large, and they are so close in some topology, that one neglects the fact that it that the number of states is finite and neglects the "gaps" between them. Here too, we have a rich source of subtle bugs. ------------------------------ Date: Tue, 11 Sep 90 11:37:28 PDT From: Rob Sartin Subject: Re: Analog vs digital failure modes and conservation laws >we'd today call an axiomatic/object-oriented approach: He designed a series >of basic primitives to manipulate things like money, accounts, and so on. The >primitives enforced, in a very transparent way, such basic "laws" as "the law >of conservation of money": Money can be neither created nor destroyed, it One can only hope that such a system would not (or would depending on your views on controlling the economy) be used at the Federal Reserve. Rob Sartin, Software &Systems Lab, Hewlett-Packard hplabs!sartin 415-857-7592 ------------------------------ Date: Wed, 12 Sep 90 09:15:35 BST From: David Murphy Subject: Analog vs digital reliability (Wolitzky, RISKS-10.33) > Might as well carry this nit-picking one level further. As long as > your computer's transistors, capacitors, or whatever rely on electrons, > photons, or other quantum-mechanical wave/particles with discrete > states, you are justified in considering them to be digital. But this > is all silly -- the implementation is irrelevant. If you can treat > the computer as a black box that behaves digitally, why not label it > as such? And therein lies the risk. One cannot always treat `digital' systems as digital because, as many posters have pointed out, mother nature will sometimes find ways to let you know your abstraction is unsound. One of my favourite examples is Jefferies' ``Bifurcation to Chaos in clocked digital systems containing autonomous timing elements'' (Phys. Let. A, Vol 115, No 3) where a deterministic communications protocol between two simple `digital' systems is shown to display chaos; this underlines the point that even if you start with two systems that display nice finite easily-abstracted behaviours their composition may not. Another point that is often neglected is that many of the things we assume exist in the digital world are, in fact, forbidden. It is completely impossible to build a fair arbiter or synchroniser for instance; the axioms that should hold for such a beast are mutually contradictory; no continuous system can behave that way. Thus there will come a time in every digital design when certain questions will remain unanswered until we move outside the digitial paradigm, - questions like `how fast will it run ?' (meaning `how fast will it run with a failure rate small enough that my customers won't notice'). There is no doubt that asynchronous design in a clocked digital paradigm will only work with some constraining assumptions about how quickly those `asynchronous' signals are actually likely to appear. And since the real world is asynchronous, -- there is no such thing as an isolated computer, -- this means that any digital design technique is just a way of improving engineering confidence in the product, not of guaranteeing correctness. ------------------------------ Date: Wed, 12 Sep 90 22:47:20 -0400 From: al357@cleveland.freenet.edu (John H. Whitehouse) Subject: The need for software certification In recent weeks, the risks forum has seen much discussion of software certification and of the relative risk of digital vs. analog computing. I would like to suggest another realm of discussion which relates to the notion of software certification. Specifically, I would like to see some discussion here of the value of certifying software professionals. In this regard, I refer to those certification designators administered by the Institute for Certification of Computer Professionals in Des Plaines, Illinois. From its inception, our field has never been able to find enough qualified people to meet the demand. As a result, we have drafted people from a wide variety of academic disciplines. Further, many workers in our field have never even seen the inside of a university. In addition, very few universities have offered majors in computer-related disciplines until "recently", i.e. in the last 25 years. I am not trying to condemn those people now practicing in this field who lack a "proper" academic background. In fact, I will openly state that many people who do have a computer science major are unable to perform adequately in the field. Regardless of background, some people seem to have developed an adequate understanding in this field whereas others have not. The fact that the vast majority of managers were either non-technical from the start or have become non-technical over the years means that probably the vast majority of people in this field are not receiving evaluations which reflect the actual quality of their work and the depth of their understanding. This cuts two ways. First, many poor practitioners continue to survive in the field despite their poor performance. Second, many excellent technical people fail to receive proper reward for their accomplishments. Hiring is expensive and usually done pretty much in the blind. Firing is risk-laden in our litigious society. It is my contention that the vast majority of software defects are the product of people who lack understanding of what they are doing. These defects present a risk to the public and the public is not prepared to assess the relative skill level of software professionals. For these reasons, I favor the certification of software professionals. We have tried to bring this about for 28 years on a voluntary basis, but those who know that they could never pass make high sounding arguments to convince others that voluntary certification is not a desirable goal. Academics have not joined in the debate since they are generally immune from the problem. I would like to hear some discussion on this issue. [We have been around this topic before, with Nancy Leveson in RISKS-5.28 and following discussion in RISKS-5.33, Appendix B of the ACARD report noted in RISKS-4.14, John Shore with some appropriate references in RISKS-4.78, and earlier discussions. Perhaps it is time to try again. PGN] ------------------------------ Date: Thu, 13 Sep 90 09:43:15 GMT From: DBQABAA@CFRVM Subject: ZIP code correcting software Several administrative departments on our campus are interested in purchasing software which claims to validate and correct ZIP codes. This could be used interactively to tell an operator that he/she is keying a ZIP code which is invalid for the given street and city information; it could also be used in batch to "clean-up" address data. Apparently the US Post Office maintains massive tables of address data which various software vendors use as the basis for these kinds of software packages. Does anyone have experience with this kind of software? My concerns would be as follows: 1. What is the error rate with this process? 2. What happens when additions and changes are made by the Post Office to their tables but the vendor has not yet gotten the updates out to the end user of the software? Will the software keep "correcting" a ZIP code which is in fact already correct? I've had personal experiences where some of my mail suddenly shows up with a changed (and wrong) ZIP code. After contacting the sender and getting the ZIP code changed back to what it should be, I've seen it get changed again a few months later. It's obvious to me that they are running a batch update on their mailing list using software that says my ZIP code is wrong. This seems to me to be part of a trend for people to put their faith in "error-correcting" software which can't always tell what really needs correcting. --- Rich Meyer Richard W. Meyer, University Computing Services, University of South Florida, Tampa DBQABAA@CFRVM.BITNET ------------------------------ Date: 12 Sep 90 16:19:14 EDT From: Jeff Jacobs <76702.456@compuserve.com> Subject: Software Bugs "stay fixed"? (Steve Smith, RISKS-10.35) >My experience with "bit rot", where previously solved problems reappear, is >that they are usually caused by poor configuration control. While most systems >have CC tools, like sccs on UNIX or CMS on VAX/VMS, getting your friendly >average programmer to use them is like pulling teeth. When management insists >on use of the tools, you will find lots of log entries of the form: >Solutions? The following is a description of a "solution". (Way back in the early days of the TDRSS ground station network development...) During my troubleshooting activities, it quickly became obvious that many of the "show-stoppers" were not "bugs", but were a combination of procedural, operational and communication problems. The large number of different test configurations, the number of different groups performing testing, and the technical inexperience of the test and configuration management personnel resulted in a chaotic situation. Although a conceptually sound set of procedures had been drafted, they were manual and paper-based, and were unable to keep up with the level of activity. An average of 4 days per month were lost simply due to errors in editing command files for compiling and linking. These were *project* months, where several dozen people would be waiting for me to resolve a "show-stopper". I would spend enormous amounts of time trying to resolve what software was in use, how it was used, etc, etc. Furthermore, it was almost impossible for management to determine the status of fixes, enhancements and updates to the software. The situation was quite amenable to automation. Although my initial proposal to TRW was turned down, I proceeded to create an initial version for my own group's use; we were making "quick-fixes" at a rapid rate, and needed the support that the tool would provide. The tool, called the "Fix Processor", was completed in six weeks. It is effectively an expert system, built using Object Oriented and AI techniques in a Lisp-like language. (It was described in a prize winning paper, "Utilizing Expert Systems to Improve the Configuration Management Process", by Sherri Sweetman, George Washington, University, for the Project Management Institute). This is **NOT** a "version control" system; it is much more complex. Software "fixes" (including updates and enhancements), migrated physically. Initially, all software resides in controlled baselines on development machines. As developers made changes to a given task, code was taken from the baseline and moved to work areas. As it went through various stages of testing and acceptance, it transitioned to physical control of other groups. Prior to a delivery to WSGT, all accepted changes would be "rolled" into a new baseline and the entire process would start over. The Fix Processor was initially used by the various integration and testing groups. Once its effectiveness was proven, it was also propagated to the development groups, who, although initially opposing it, soon became its strongest advocates!!! Features of the Fix Processor included: - Automated the tedious, time consuming and error prone operations required to transition a piece of software. This includes creation of new directories, copying files, generation of command files, etc., all of which formerly had to be performed manually. This was crucial to the widespread acceptance and use of the system. - Separated compiling and linking of new fixes from actual incorporation into testing configurations. (Tasks had to compiled and linked prior to actual testing. It was also sometimes necessary to remove a new version of a task from a test configuration). - An on-line reporting and query facility for determining the status of software, usage in test configurations, etc. This was not only a key management tool, it also provided a means of communication between the various integration and testing groups. Problems which formerly took days to resolve were literally reduced to minutes. - Extensive validation of user inputs and requests. - Support and tracking for special requirements, such as debugging, "quick and dirty" fixes, etc. - Automated collection of all approved changes into a new baseline, a process which formerly required more than a week (and was incredibly error prone). - Co-ordinated task transitions with the necessary paper trail. (I subsequently revised the paper work system). - Logged all activities and results into a database. The net result of these changes was a smoothly functioning, manageable project. Mammoth turn-overs were eliminated. Software was turned over by the developers in small, manageable increments on a continuous basis (as opposed to the previous "here it all is method", which would take weeks to straighten out). Problems were easily identified and quickly solved ; productivity and morale were immeasurably improved and the project became truly manageable. Note that one of the key elements to the success of the FIX Processor is that it made life *easier* for everybody involved (including me). Note also that the "automation" helped ensure that problems didn't recur, and it was quite easy to verify the complete path of software. Jeffrey M. Jacobs, ConsArt Systems Inc, Technology & Management Consulting P.O. Box 3016, Manhattan Beach, CA 90266 (213)376-3802 ------------------------------ End of RISKS-FORUM Digest 10.37 ************************