Subject: RISKS DIGEST 14.48 REPLY-TO: risks@csl.sri.com RISKS-LIST: RISKS-FORUM Digest Wednesday 7 April 1993 Volume 14 : Issue 48 FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Contents: Re: Shuttle Failure Blamed On Computer Glitch (Kriss A. Hougland) Safety-Critical Software, special issue of IEEE Software (John Knight) London Ambulance Service Inquiry Report (Brian Randell) [long/definitive] The RISKS Forum is a moderated digest discussing risks; comp.risks is its Usenet counterpart. Undigestifiers are available throughout the Internet, but not from RISKS. Contributions should be relevant, sound, in good taste, objective, cogent, coherent, concise, and nonrepetitious. Diversity is welcome. CONTRIBUTIONS to RISKS@CSL.SRI.COM, with appropriate, substantive "Subject:" line. Others may be ignored! Contributions will not be ACKed. The load is too great. **PLEASE** INCLUDE YOUR NAME & INTERNET FROM: ADDRESS, especially .UUCP folks. REQUESTS please to RISKS-Request@CSL.SRI.COM. Vol i issue j, type "FTP CRVAX.SRI.COMlogin anonymousAnyNonNullPW CD RISKS:GET RISKS-i.j" (where i=1 to 14, j always TWO digits). Vol i summaries in j=00; "dir risks-*.*" gives directory; "bye" logs out. The COLON in "CD RISKS:" is essential. "CRVAX.SRI.COM" = "128.18.10.1". =CarriageReturn; FTPs may differ; UNIX prompts for username, password. For information regarding delivery of RISKS by FAX, phone 310-455-9300 (or send FAX to RISKS at 310-455-2364, or EMail to risks-fax@cv.vortex.com). ALL CONTRIBUTIONS CONSIDERED AS PERSONAL COMMENTS; USUAL DISCLAIMERS APPLY. Relevant contributions may appear in the RISKS section of regular issues of ACM SIGSOFT's SOFTWARE ENGINEERING NOTES, unless you state otherwise. ---------------------------------------------------------------------- Date: Wed, 7 Apr 1993 11:35:21 -0700 From: "Kriss A. Hougland" Subject: Re: Shuttle Failure Blamed On Computer Glitch (RISKS-14.47) From all the information on the shuttle delay, the situation seems to be: A faulty sensor or broken wire that monitors that status of a valve. So far, I have heard that the problem is still a computer glitch. This is not correct. The software performed as required. The solution to the problem is: 1) find and fix the problem -- I would speculate a very $$$ option 2) update the software to override the situation -- quick and easiy, but very risky if the problem is the valve. It looks like people are fixing hardware problems in software again. There is a classic risk of overriding hardware problems with software while introducing the ability to do the override correctly, or by a nasty side effect by the program (oops -- I was using that variable to turn on the engines!) I hope at NASA, they are willing to assume the risk of correcting hardware problems in software. (NASA does have some good brains so I think they are taking a very educated guess from the telemetry.) I would hate to see another shuttle go up in flames (sorry about the pun). ------------------------------ Date: Wed, 7 Apr 93 16:20:35 EDT From: jck@neptune.cs.virginia.edu Subject: Safety-Critical Software, special issue of IEEE Software CALL FOR ARTICLES IEEE SOFTWARE SAFETY-CRITICAL SOFTWARE A forthcoming special issue of IEEE Software will focus on safety-critical software development. The theme of the special issue is to document recent achievements and current challenges in both research and application of safety-critical software technology. Papers are solicited that report recent research results, both theoretical and experimental. Similarly, papers are solicited that document the best current practices, experience with these practices, and the major outstanding problem that the applications community sees. Original articles are sought on relevant topics including (but are not limited to): o Experience in safety-critical applications development in areas such as avionics, nuclear power system, and medical devices. o Results of experiments in any area related to safety-critical software development. o Significant challenge areas whose definition and motivation arise from practical experience. o Development methods, processes, and standards designed for safety-critical software. o Specification and verification techniques. o Dependability assessment and modelling. o Tools and environments supporting safety-critical software development. Submitted papers must not have been previously published nor be under consideration for publication elsewhere. To be considered for the special issue, please send eight copies of the complete manuscript to either of the guest editors: John C. Knight Bev Littlewood Department of Computer Science Center for Software Reliability University of Virginia The City University Thornton Hall Northampton Square Charlottesville London, EC1V 0HB VA 22903, USA UK (knight@virginia.edu) (b.littlewood@city.ac.uk) Submission deadline is June 15 for IEEE SOFTWARE ------------------------------ Date: Wed, 24 Mar 1993 12:58:12 GMT From: Brian.Randell@newcastle.ac.uk Subject: London Ambulance Service Inquiry Report (long) [Brian noted that his reason for sending this to RISKS was that, unlike the previous postings, this one is AUTHORITATIVE. He also wanted to give a clear impression of the scope and level of detail of the computer-related parts of the report, and of how they fitted into the report as a whole. PGN] I have today managed to obtain a copy of the actual 80-page "Report of the Inquiry into the London Ambulance Service, February 1993". The terms of reference of the Inquiry were "To examine the operation of the CAD [Computer-Aided Dispatch] system, including: a) the circumstances surrounding its failures on Monday and Tuesday 26 and 27 November 1992 b) the process of its procurement and to identify the lessons to be learned for the operation and management of the London ambulance Service against the imperatives of delivering service at the required standard, demonstrating good working relationships and restoring public confidence." The Inquiry Team membership is listed as - Don Page, Chief Executive of South Yorkshire Metropolitan Ambulance and Paramedic Service NHS Trust - Paul Williams, senior computer audit partner of BDO Binder Hamlyn - Dennis Boyd, former Chief Conciliation Officer of the Advisory Conciliation and arbitration Service (ACAS) The principal background facts given about the LAS in the report are that the service "covers a geographical area of about 600 square miles. It is the largest ambulance service in the world. It covers a resident population of some 6.8 million, but its daytime population is larger particularly in Central London. LAS carries over 5,000 patients every day. It receives between 2,000 and 2,500 calls daily; this includes between 1,300 and 1,600 999 calls." The Inquiry's Report carries no copyright notice, and is freely available (see end of this message). Here are the scanned-in Table of Contents, and the complete text of the Sections entitled "COMPUTER AIDED DESPATCH SUMMARY", "COMPUTER AIDED DISPATCH RECOMMENDATIONS", "KEY SYSTEM PROBLEMS", "CAUSES AND EFFECTS OF BREAKDOWN ON 26 AND 27 OCTOBER 1992", and "FAILURE OF THE COMPUTER SYSTEM. 4 NOVEMBER 1992" (The section "CAUSES AND EFFECTS OF BREAKDOWN ON 26 AND 27 OCTOBER 1992" also contains a very detailed and interesting "Cause-Effects" diagram, with about 35 boxes and many directed links, which is not reproduced here.) Brian Randell, Dept. of Computing Science, University of Newcastle, Newcastle upon Tyne, NE1 7RU, UK Brian.Randell@newcastle.ac.uk +44 91 222 7923 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = REPORT OF THE INQUIRY INTO THE LONDON AMBULANCE SERVICE FEBRUARY 1993 CONTENTS SECTION and Sub-Section SUMMARY, CONCLUSIONS AND RECOMMENDATIONS Computer Aided Despatch Summary Management and Operations Summary Computer Aided Despatch Conclusions Management and Operations Conclusions Computer Aided Despatch Recommendations Management and Operations Recommendations Resource Implications of Inquiry Team Report BACKGROUND Terms of Reference and Inquiry Team Membership Facts About the LAS Computer Aided Despatch LAS and CAD Report Description THE SYSTEM AND ITS DEVELOPMENT Rationale For a CAD System Background to CAD Concept/Design Supplier Selection - The Procurement Process Project Management Systems Testing/Implementation Technical Communications Human Resources and CAD Training The System Structure 26 AND 27 OCTOBER AND 4 NOVEMBER 1992 CAD Conclusions Demand on LAS Services 26 and 27 October Key System Problems System Configuration Changes Causes and Effects of Breakdown on 26 and 27 October 1992 Failure of the Computer System, 4 November 1992 THE WAY FORWARD FOR CAD MANAGEMENT AND OPERATION OF THE LAS The Scope of LAS Operations Managing the LAS Management / Union Relationships Resource Management Personnel Management LAS Accountability Public Confidence ANNEX A: List of organisations and individuals who gave evidence ANNEX B: Glossary of abbreviations ----------------- COMPUTER AIDED DESPATCH SUMMARY 1001 What is clear from the Inquiry Team's investigations is that neither the Computer Aided Despatch (CAD) system itself, nor its users, were ready for full implementation on 26 October 1992. The CAD software was not complete, not properly tuned, and not fully tested. The resilience of the hardware under a full load had not been tested. The fall back option to the second file server had certainly not been tested. There were outstanding problems with data transmission to and from the mobile data terminals. There was some scepticism over the accuracy record of the Automatic Vehicle Location System (AVLS). Staff, both within Central Ambulance Control (CAC) and ambulance crews, had no confidence in the system and were not all fully trained. The physical changes to the layout of the control room on 26 October 1992 meant that CAC staff were working in unfamiliar positions, without paper backup, and were less able to work with colleagues with whom they had jointly solved problems before. There had been no attempt to foresee fully the effect of inaccurate or incomplete data available to the system (late status reporting/vehicle locations etc.). These imperfections led to an increase in the number of exception messages that would have to be dealt with and which in turn would lead to more call backs and enquiries. In particular the decision on that day to use only the computer generated resource allocations (which were proven to be less than 100% reliable) was a high risk move. 1002 Whilst understanding fully the pressures that the project team were under to achieve a quick and successful implementation it is difficult to understand why the final decision was made, knowing that there were so many potential imperfections in the system. 1003 The development of a strategy for the future of computer aided despatch within the London Ambulance Service (LAS) must involve a full process of consultation between management, staff, trade union representatives and the Service's information technology advisers. It may also be appropriate to establish a wider consultative panel involving experts in CAD from other ambulance services, the police and fire brigade. Consequently the recommendations from the Inquiry Team should be regarded as suggestions and options for the future rather than as definitive recommendations on the way forward. What is certain is that the next CAD system must be made to fit the Service's current or future organisational structure and agreed operational procedures. This was not the case with the current CAD. ----------------- COMPUTER AIDED DESPATCH RECOMMENDATIONS 1009 These are the main recommendations drawn by the Inquiry Team from its investigations into the CAD system, each of which is covered fully in the main text. We recommend: a) that LAS continues to plan the implementation of a CAD system [3009]; b) that the standing financial instructions should be extended to provide more qualitative guidance for future major IT procurements [3032]; c) that any future CAD system must conform to the following imperatives: i. it must be fully reliable and resilient with fully tested levels of back-up; ii. it must have total ownership by management and staff, both within CAC and the ambulance crews; iii. it must be developed and introduced in a timescale which, whilst recognising the need for earliest introduction, must allow fully for consultation, quality assurance, testing, and training; iv. management and staff must have total, demonstrable, confidence in the reliability of the system; v. the new system must contribute to improving the level and quality of the provision of ambulance services in the capital; vi. any new system should be introduced in a stepwise approach, with, where possible, the steps giving maximum benefit being introduced first; vii. any investment in the current system should be protected and carried forward to the new system only if it results in no compromises to the above objectives [5004]; d) re-training of CAC staff be carried out on the system to ensure that they are familiar with its features and that they are operating the system in a totally consistent way [5025]; e) a suitably qualified and experienced project manager be appointed immediately to coordinate and control the implementation of the proposed first stage of CAD [50271; f) that a specialist review be undertaken of communications in the light of the final objectives of CAD and that any recommendations arising are actioned as part of the proposed second phase of CAD [5033]; g) the establishment of a Project Subcommittee of the LAS Board [5040]; h) that LAS recruit an IT Director, who will have direct access to the LAS Board [5041]. ----------------- KEY SYSTEM PROBLEMS 4007 As detailed earlier there were a number of basic flaws in the CAD system and its supporting infrastructure. In summary, the system and its concept has several major problems: a) a need for near perfect input information in an imperfect world; b) poor interface between crews, MDTs [Mobile Data Terminals] and the system; c) unreliability, slowness and operator interface. **Need for near perfect information** 4008 The system relied on near perfect information of vehicle location and crew/vehicle status. Without accurate knowledge of vehicle locations and status the system could not allocate the optimum resource to an incident. Although some poor allocations may be attributable to errors in the allocation routine, it is believed that the majority of allocation errors were due to the system not knowing the correct vehicle location or status of vehicles that may have proved more appropriate. **Poor interface between crews, MDTs and the system** 4009 Given that the system required almost perfect information on vehicle location and status, each of the component parts of the chain from crews to despatch system must operate well. This was not the case. From our investigations, possible reasons for the despatch system not knowing the correct vehicle location or status of vehicles that may have proved more appropriate: a) a failure of the system to catch all of the data; b) a genuine failure of crews to press the correct status button owing to the nature and pressure of certain incidents; c) poor coverage of the radio system, i.e. black spots; d) crews failing to press status buttons as they became frustrated with re-transmission problems; e) a radio communications bottle neck, e.g. when crews commence duty and try to log on via their vehicle's MDT or during other busy periods; f) missing or swapped callsigns; g) faults in the "hand shaking" routines between MDTs and the despatch system, eg MDTs showing Green and OK, but system screens showing them in a different status; h) crews intentionally not pressing the correct status buttons or pressing them in an incorrect order; i) crews taking a different vehicle to that which they have logged on to, or a different vehicle/crew responding to that allocated by the system; j) incorrect or missing vehicle locations; k) too few call takers. 4010 The above reasons are often interconnected. **Unreliability, Slowness and Operator Interface** 4011 It is reported that the system "fell over" a few times before 26 October 1992. More common was the frequent "locking up" of screens. Staff had been instructed to re-boot their screens if they locked up. The system also slowed up when under load and whilst it was doing its "house keeping" at 02:00 hours each morning. 4012 General imperfections include: a) failure to identify all duplicated calls; b) lack of prioritisation of exception messages; c) exception messages and awaiting attention queues scrolling off the top of the allocators'/exception rectifiers' screens; d) software resource allocation errors; e) general robustness of the system (workstation and MDT "lockups"); f) slow response times for certain screen based activities. ----------------- CAUSES AND EFFECTS OF BREAKDOWN ON 26 AND 27 OCTOBER 1992 4016 On 26 and 27 October 1992 the computer system itself did not fail in a technical sense. Response times did on occasions become unacceptable, but overall the system did what it had been designed to do. However, much of the design had fatal flaws that would, and did, cumulatively lead to all of the symptoms of systems failure. 4017 In order to work effectively the system needed near perfect information all of the time. Without this the system could not be expected to propose the optimum resource to be allocated to an incident. There were many imperfections in this information which individually may not be serious, but which cumulatively were to lead to system "failure". 4018 The changes to CAC operation on 26 and 27 October 1992 made it extremely difficult for staff to intervene and correct the system. Consequently, the system rapidly knew the correct location and status of fewer and fewer vehicles. The knock on effects were: a) poor, duplicated and delayed allocations; b) a build up of exception messages and the awaiting attention list; c) a slow up of the system as the messages and lists built up; d) an increased number of call backs and hence delays in telephone answering. 4019 Each effect quickly reinforced the others leading to severe lengthening of response times. A more detailed explanation follows. 4020 A cause and effect diagram is shown opposite, Diagram 4.5, for the operation of the system on 26 and 27 October 1992. As the number of incidents increases there are several naturally reinforcing loops which escalate the problems. A description of the course of events and interactions follows. 4021 When the system was fully implemented at 07:00 hours 26 October 1992 the system was lightly loaded. Staff and system could cope with the various problems (left hand side of the diagram) which caused the despatch system to have imperfect information on the fleet and its status. As the number of incidents increased, incorrect vehicle location or status information received by the system increased. With the new room configuration and method of operation, allocators were less able to spot and correct errors. 4022 The amount of incorrect location and status information in the system increased with four direct effects: a) the system made incorrect allocations: multiple vehicles sent to same incident, or not the closest vehicle sent; b) the system had fewer resources to allocate, increasing the problems of effect a); c) as previously allocated incidents fed through the system and suffered from the problems on the left hand side of the diagram which resulted in the system not having the resource's correct status, the system placed covered calls that had not gone through the amber, red, green status cycle, back on the attention waiting list; d) failures because of the problems on the left hand side of the diagram caused the system to generate exception messages. 4023 Starting with effect 4022 d), the number of exception messages increased rapidly to such an extent that staff were unable to clear the queue. As the exception message queue grew the system slowed. The situation was made worse as unrectified exception messages generated more exception messages. With the increasing number of "awaiting attention" and exception messages it became increasingly easy to fail to attend to messages that had scrolled off the top of the screen. Failing to attend to these messages arguably would have been less likely in a "paper-based" environment. 4024 Effects 4022 b) and c). With fewer resources to allocate the system would recommend what it saw as the closest vehicle. This was often an incorrect allocation as a closer vehicle was actually available. It took longer to allocate resources for three reasons: a) the allocator had to spend more time finding and confirming suitable resources; b) incidents were held until a suitable resource became available; c) resource proposal software took longer to process as resources became more distant. 4025 There was a re-enforcing effect in that as allocators tried to contact a resource, that resource was unavailable for allocation to another incident. Once an allocator "clicked onto" a resource its status turned to dark green thus preventing it from being allocated elsewhere. It is reported that one allocator was allocating resources, but not mobilising them. Any delay in allocation or mobilisation was a delay to a patient. 4026 It also took longer to allocate resources as more two line summaries fed through the system. Standard two line summaries of incidents awaiting resource allocation included those that had previously been covered, but were not seen by the system as complete. As this queue built up it caused the system to slow. 4027 At one stage two line summaries were scrolling onto the screen so fast that in trying to stop summaries moving off the screen, allocators were further slowed in their tasks. 4028 In summary, effects 4022 b) and c) contributed to incorrect allocations, a slowing of the system and uncovered incidents all leading to delays to patients. The number of uncovered incidents was probably increased when at one stage the exception report queue was cleared in an effort to increase the speed of the system. 4029 Effect 4022 a), incorrect allocations, led directly to patient delays and crew frustration. Crew frustration was further increased by delays in arriving at the scene and the reaction from the public. 4030 Crew frustration may have been responsible for: a) increasing the instances when crews didn't press the status buttons in the correct sequence; b) the allocated crew taking a different vehicle, or a different crew and vehicle responding to the incident. 4031 In the month preceding 26 and 27 October 1992 crew frustration also led to an increase in radio traffic which, owing to the potential for radio bottlenecks, increased the number of failed data mobilisations and voice communication delays. In turn, and completing the loop, failed data mobilisations and voice communications delays lead to further increased voice communications and crew frustration. On 26 October instruction was for minimum voice communication. Statistics show that the number of successful data mobilisations increased. However, with no voice communications, wrong or multiple allocations were not corrected thus negating the beneficial effect of increase data mobilisations. 4032 Turning to telephone communications between the public and CAC, delays to patients and uncovered incidents greatly increased the number of call backs, thus increasing the total number of calls handled. An increased call volume, together with a slow system and too few call takers caused significant delays in telephone answering, thereby further increasing delays to patients. -------------- FAILURE OF THE COMPUTER SYSTEM. 4 NOVEMBER 1992 4033 Following the CAD problems of 26 and 27 October 1992, CAC had reverted to a semi manual method of operation, identical to that which had operated with a variable degree of success before 26 October. 4034 This method of working comprised: a) calls being taken on the CAD system (including use of gazetteer); b) incident details being printed out in CAC; c) optimum vehicle resource identified through contact with nearest station to incident; d) mobilisation of the resource via CAD, direct to the station printer or to the MDT. 4035 In general CAC staff were comfortable with operating this system as they found the computer based call taking and the gazetteer for the most part reliable. There were known inadequacies with the gazetteer and occasional "lock-up" problems with workstations, but overall the benefits outweighed the disadvantages. The vehicle crews were also more comfortable as the stations still had local flexibility in deciding which resource to allocate to an incident. The radio voice channels were available to help clear up any mobilisation misunderstandings. Largely as a result of the problems of the previous week, additional call taking staff had been allocated to each shift thus reducing significantly the average call waiting time. 4036 This system operated with reasonable success from the afternoon of 27 October 1992 up to the early hours of 4 November. 4037 However, shortly after 2am on 4 November the system slowed significantly and, shortly after this, locked up altogether. Attempts were made to re-boot (switch off and restart workstations) in the manner that CAC staff had previously been instructed by Systems Options to do in these circumstances. This re-booting failed to overcome the problem with the result that calls in the system could not be printed out and mobilisations via CAD from incident summaries could not take place. CAC management and staff, having assured themselves that all calls had been accounted for by listening to the voice tapes, and having taken advice from senior management, reverted fully to a manual, paper-based system with voice or telephone mobilisation. As these problems occurred in the early hours when the system was not stretched the operational disruption was minimised. 4038 SO [Systems Options Ltd.] were called in immediately to investigate the reasons for the failure. In particular LAS required an explanation as to why the specified fallback to the standby system had not worked. 4039 The Inquiry Team has concluded that the system crash was caused by a minor programming error. In carrying out some work on the system some three weeks previously the SO programmer had inadvertently left in the system a piece of program code that caused a small amount of memory within the file server to be used up and not released every time a vehicle mobilisation was generated by the system. Over a three week period these activities had gradually used up all available memory thus causing the system to crash. This programming error should not have occurred and was caused by carelessness and lack of quality assurance of program code changes. Given the nature of the fault it is unlikely that it would have been detected through conventional programmer or user testing. 4040 The failure of the fallback procedures arises as a consequence of what was believed at the time to be only a temporary addition of printers. The concept of the system was that it would operate on a totally paperless basis. Printers were only added, as a short term expedient, in order to implement at least a partial system at the originally planned implementation date of 8 January 1992. 4041 The fallback to the second server was never implemented by SO as an integral part of this level of CAD implementation. It was always specified, and indeed implemented, as part of the complete paperless system and thus arguably would have activated had the system actually crashed on 26 and 27 October 1992. However, there is no record of this having been tested and there can be no doubt that the effects of server failure on the printer-based system had not been tested. This was a serious oversight on the part of both LAS IT staff and SO and reflects, at least in part, the dangers of LAS not having their own network manager. ISBN 0 905133 70 6 Further copies available from: Communications Directorate, South West Thames Regional Health Authority, 40 Eastbourne Terrace, London W2 3QR 071-725 2551 Dept. of Computing Science, University of Newcastle, Newcastle upon Tyne, NE1 7RU, UK Brian.Randell@newcastle.ac.uk PHONE = +44 91 222 7923 ------------------------------ End of RISKS-FORUM Digest 14.48 ************************