Subject: RISKS DIGEST 11.30 REPLY-TO: risks@csl.sri.com RISKS-LIST: RISKS-FORUM Digest Monday 18 March 1991 Volume 11 : Issue 30 FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Contents: "The Trigger Effect" and Coke robot... (Dwight D. McKay) Strange numbers on your beeper (Esther Filderman) Re: "What the laws enforce" [RTM] (TK0JUT1) Voice Recognition Experiment (Dave Turner) `Sendsys' forgery - denial of service? (Doug Sewell) Re: Medical privacy and urine testing (Alan Wexelblat) Re: Long-lived bugs (Pete Mellor) A cautionary tale [long] (John DeTreville) [On the Midway in a 3-ring SRCus?] The RISKS Forum is moderated. Contributions should be relevant, sound, in good taste, objective, coherent, concise, and nonrepetitious. Diversity is welcome. CONTRIBUTIONS to RISKS@CSL.SRI.COM, with relevant, substantive "Subject:" line. Others ignored! REQUESTS to RISKS-Request@CSL.SRI.COM. For vol i issue j, type "FTP CRVAX.SRI.COMlogin anonymousAnyNonNullPW CD RISKS:GET RISKS-i.j" (where i=1 to 11, j always TWO digits). Vol i summaries in j=00; "dir risks-*.*" gives directory; "bye" logs out. ALL CONTRIBUTIONS CONSIDERED AS PERSONAL COMMENTS; USUAL DISCLAIMERS APPLY. Relevant contributions may appear in the RISKS section of regular issues of ACM SIGSOFT's SOFTWARE ENGINEERING NOTES, unless you state otherwise. ---------------------------------------------------------------------- Date: Fri, 15 Mar 1991 13:33:26 -0500 (EST) From: "Dwight D. McKay" Subject: "The Trigger Effect" and Coke robot... Having just gotten into work after being stranded at home with no power for two days due to ice storm here in the midwest, I am reminded of the reliance we all place on basic services. While I've not lost phone service (thank you AllTel!) nor Gas, but I had no electricity. This meant I had no heat as my furnace needs electricity to sense temperature, run the air circulation fan and even start the gas burning (it's pilot light-less). Our kitchen is all electric so that was out, and so on. Even when power was restored, the ordeal was not all over. How many clocks, and embedded computers do you have around your house? I had to replace half a dozen "backup" batteries, reset various devices which have no memory without power, etc. A very worthwhile description of this "technology trap" we are placed in my depending on basic services like electricity is episode 1, "The Trigger Effect" of James Burke's PBS series "Connections". It covers in fairly good detail the sequence of events and problems caused by the early 60's east coast blackout. I'd recommend it as good video for Risks readers to watch or to show to others. The video has started some very interesting conversations concerning the risks of high technology with everyone I've shown it to. BTW - Have any of the rest of you seen the drink dispensing robot Hardee's has in some stores now? It appears to be directly tied into the same network as their cash registers and fills drink orders while the cashier takes your money. I can see it now, "Sorry, we cannot give you a drink right now, our computer is down." Sigh... Dwight D. McKay, Purdue University, Engineering Computer Network (317) 494-3561 mckay@ecn.purdue.edu --or-- ...rutgers!pur-ee!mckay ------------------------------ Date: Fri, 15 Mar 91 16:29:12 -0500 (EST) From: Esther Filderman Subject: Strange numbers on your beeper The article about the beeper scam reminded me of something that occured to me two weeks ago. When my beeper went off in the middle of a Saturday afternoon I was not phased by the strange number that appeared, figuring that it was a coworker calling from home. When I called the number I got a real surprise: I reached US Air's pilot scheduling number! The person I spoke with told me that the database of beeper numbers was very out of date. When I mentioned that I had had my beeper for over six months she responded that she had once called a number a year out of date. Meanwhile, some poor pilot was wondering when her/his next flight was.... Esther C. Filderman, System Manager, Mercury Project. Computing Services, Carnegie Mellon University ef1c+@andrew.cmu.edu ------------------------------ Date: Fri, 15 Mar 91 15:47 CST From: TK0JUT1@NIU.BITNET Subject: Re: "What the laws enforce" [RTM] (RISKS-11.29) I rather liked PGN's comment that "that there is still a significant gap between what it is thought the laws enforce and what computer systems actually enforce." It's parsimonious and incisive. I interpreted it to mean simply that the law has not caught up to changing technology, and old, comfortable legal metaphors are inappropriately applied to new, qualitatively different conditions. Calling simple computer trespass (even if files are perused) a heavy-duty felony subjecting the offender to many years in prison does not seem productive. I may walk on your grass and pick your flowers, even if there is a prohibitive sign. But, it is unlikely there would be a prosecution (informal sanctions, yes, but not a prosecution), and if there were, it would unlikely be a highly publicized felony that subjects me to federal felony charges, even though an ecological federal interest might be claimed. The point seems to be that emerging computer laws are archaic. Neither those who write the laws nor those who implement them have a clear understanding of what is involved or at stake. When mere possession (not use, but possession) of "forbidden knowledge" can be a felony (as it is in California), we must begin to question what the law thinks it's enforcing. One can oppose trespass while simultaneously opposing Draconian attempts to stomp on those who tread tackily in our once-pastoral fields. And, at the moment, I suggest that it's law enforcement agents who are the greatest danger to the computer world, not hackers. Why? Because "there is still a significant gap between what it is thought the laws enforce and what computer systems actually enforce." [Thanks! PGN] ------------------------------ Date: Fri, 15 Mar 91 15:49:44 PST From: dmturne@ptsfa.pacbell.com (Dave Turner) Subject: Voice Recognition Experiment The following was excerpted from comp.dcom.telecom. Although it appears to be a legitimate study, the unscrupulous could reap vast rewards. >The Oregon Graduate Institute of Science and Technology is building a >huge database of voices as part of a project to develop voice >recognition for US West directory assistance. > >They want to be able to classify sounds according to regional >differences, and they need thousands of samples of speech to do this. > >Call 800-441-1037 (I assume this is nationwide ... it may not be) and >follow the voice prompts. They will ask your last name, where you are >calling from, and where you grew up, and then ask you to pronounce >several words and recite the alphabet. This could be used for vocal forgery. By combining the requested words, alphabet and, possibly, numbers a digital vocabulary could be produced for everyone who participated in the study. Once this is available, a "bad guy" could use it to place phone calls using anyone's digital voice. If the hardware were fast enough, the called party could be fooled into believing that he/she is talking to the individual whose voice is being used. The addition of credit card numbers and expiration dates for each "voice" will allow fraud that is hard to dispute; after all, it's your word (voice) against his. Including your name, location and other personal information in this study could be a big mistake. This sort of risk is made easier by duping people to provide samples of their voices but a determined "bad guy" could obtain the same information by recording a ordinary phone call and processing the data later. ------------------------------ Date: Friday, 15 Mar 1991 20:25:20 EST From: Doug Sewell Subject: `Sendsys' forgery - denial of service? I don't get many SENDSYS requests in control, so they tend to stick out. I've also learned by experience that even limited hierarchy or limited distribution will result in a disruptive amount of e-mail (how did I know that over 400 sites got some bit.* - I suspected 100, tops). Many of them are big (UUNET's was several thousand lines long), and they trickle in for days. Having said this, one I got today stuck out as being rather unusual. Someone forged a sendsys to rec.aquaria, misc.test, and alt.flame, in the name of one of the 'celebrities' in those circles. Distribution was unlimited. This type of prank amounts to a significant denial- of-service attack, IMHO. In this case, it may also mean bodily injury for the perpetrator, if he's caught. (If you want to know who, go look in alt.flame). Doug Sewell, Tech Support, Computer Center, Youngstown State University, Youngstown, OH 44555 doug@ysub.bitnet doug@ysub.ysu.edu ------------------------------ Date: Mon, 18 Mar 91 13:42:25 est From: wex@PWS.BULL.COM Subject: Re: Medical privacy and urine testing (Larry Nathanson, RISKS-11.29) The issues surrounding urine testing are something I have been researching heavily for over a year, more as they began to affect my life more extensively. While I generally agree with Nathanson's assertions, he does make one important error: (drug testing is generally not part of your record) This is true some of the time, but misleading. The discussion revolves around privacy and one of the concerns about urine testing is that testing agencies (gov't, companies and the military) generally require you to sign a form detailing any and all prescription medications you are taking. In many cases, the testing agencies require the testee to produce the actual prescriptions, and may call the prescribing doctor to confirm the validity of the prescriptions. This information is clearly part of your medical record and it seems an invasion of privacy to require the employee to reveal that s/he is taking {birth control pills, AZT, insulin, anti-depressants, etc.}. In each case, access to prescription information reveals an enormous amount of medical information which is customarily assumed to be private. --Alan Wexelblat phone: (508)294-7485 Bull Worldwide Information Systems internet: wex@pws.bull.com ------------------------------ Date: Mon, 18 Mar 91 10:30:54 PST From: Pete Mellor Subject: Long-lived bugs Jerry Bakin's item in RISKS-11.29 about the 25 year-old known bug reminded me of some stories about fairly ancient unknown bugs. I was told by a colleague, who was a computer engineer, about a UK site which required its operating system to be enormously reliable. (They were so highly secret that I was not supposed to know that they existed, so he couldn't provide much in the way of supporting detail.) They had learned the hard way that each new version brought with it its own crop of new bugs, and so had stayed resolutely out of date for many years. Running a stable job mix and not updating, they eventually achieved 4 years of failure-free running. At the end of that time, a new, serious, bug was discovered. This had lain dormant all that time. The Air-Traffic Control system at West Drayton has recently been replaced. The previous system had been in use for many years. A software engineer who had studied this system told us that a new bug was recently discovered in a piece of COBOL code which had not been changed for 20 years. Such anecdotes could be dismissed, except that they are supported by careful research. E.N. Adams in "Optimizing preventive service of software products", IBM Research J., 28, (1), pp 2-14, 1984, describes investigations into the times to detection of bugs in a widely used operating system. He found that over 30% of all bugs reported caused a failure on average only once every 5000 running years. Peter Mellor, Centre for Software Reliability, City University, Northampton Sq., London EC1V 0HB +44(0)71-253-4399 Ext. 4162/3/1 p.mellor@uk.ac.city (JANET) ------------------------------ Date: Mon, 18 Mar 91 15:59:57 PST From: jdd@src.dec.com (John DeTreville) Subject: A cautionary tale [long] [On the Midway in a 3-ring SRCus?] This is a cautionary tale about a software failure that RISKS readers might find interesting. I wrote down this description soon after the failure, in as much detail as I could, because it made such an interesting story. I've listed some possible lessons at the end, and readers are welcome to add their own. Around 5:00 p.m. on Friday, February 16, 1990, much of the distributed environment at Digital's Systems Research Center (SRC) became unavailable. Although no machines crashed, most user operations failed with authentication errors and users could get almost no work done. Some quick diagnostic work determined the problem: the contents of the distributed name service had become corrupted. Lengthier detective work determined the long sequence of accidents that caused the corruption. I should point out to start that at no point during this episode did the name service itself fail. The design and implementation of the name service were both quite solid. All the failures were elsewhere, although they manifested themselves in the name service. SRC is purposely dependent on our distributed name service because it has numerous practical advantages over the alternatives, and because it has given us very reliable service over an extended period. (Failures of unreliable systems aren't very instructive!) First, some necessary background. SRC's research software environment is called Topaz. Topaz can run stand-alone or layered on top of Ultrix, Digital's product version of Unix. We built Topaz at SRC, and while the research ideas that we test in Topaz may influence Digital's product directions, Topaz is not a production system. Once every year or two, SRC exports snapshots of the Topaz environment to a few universities that we maintain close ties with. We collect the components of an export release, then bring the snapshot up on an isolated testbed and verify that its elements work together and do not accidentally depend on anything not in the release. Part of the information in SRC's name service is the user data traditionally stored in /etc/passwd. For administrative convenience, we still maintain /etc/passwd files, and although Topaz accesses the name service instead of /etc/passwd, administrative daemons track any changes in /etc/passwd (via a "dailyUpdate" script). For example, if users leave Digital, administrative procedures delete them from /etc/passwd; once dailyUpdate runs, all mention of them is removed from the name service. On with the story. A few months before, Lucille Glassman had built our most recent export snapshot. To test it, she put together a testbed environment in the machine room, using a small VAX named "midway" as an Ultrix-based Topaz server machine. The export testbed ran on a small Ethernet disconnected from SRC's main network. The testbed environment had its own name service, and midway had its own /etc/passwd. Midway's /etc/passwd wasn't very large--about a dozen users--and so its name service didn't hold many names. But that was intentional; it was just a testbed. Since the testbed environment was disconnected from SRC's main network, software snapshots were brought over to midway via a disk that was dual-ported between midway and bigtop, a large Ultrix server machine on SRC's main network. The disk appeared in each system's /etc/fstab (the file system table); it was moved from system to system using the /etc/mount and /etc/umount commands. Lucille would mount the disk on bigtop, copy /proj to the disk (/proj holds the Topaz environment), then unmount it from bigtop and mount it on midway as /proj. Later, after the export was completed and the tapes had been sent out, Richard Schedler and Lucille did some cleanup on midway. They turned off its Topaz servers, including the name server. They also edited midway's crontab, which runs various commands at various times, not to run dailyUpdate; there was no need for it. But they didn't reconnect midway to the network as an ordinary Ultrix machine; they left it isolated on the testbed network. Here comes the amusing part. It turns out that on the version of Ultrix running on midway, /usr/lib/crontab is a symbolic link to /etc/crontab. The cron daemon reads from /usr/lib/crontab, but the file physically resides in /etc/crontab. Knowing this, Richard and Lucille edited /etc/crontab to remove the call to dailyUpdate. The first thing that went wrong was that, at some point earlier, this symbolic link had been broken, and /usr/lib/crontab had been replaced with a copy of /etc/crontab. Most people at SRC use the Ivy text editor, which runs as a single server per user, creating a window per file. Since Ivy runs with the user's permissions, you can't use it to edit files like /usr/lib/crontab, which you can't ordinarily write. Users get around this limitation by editing copies, then moving the copies back as super-user. This is an error-prone operation, and we believe that at some time someone fumble-fingered the last step. So when Richard and Lucille edited /etc/crontab, it had no real effect; cron kept on using the old /usr/lib/crontab. Every day, midway ran dailyUpdate. But dailyUpdate tried to run a program from the Topaz environment on /proj, and the dual-ported disk holding /proj had been claimed by bigtop, so midway couldn't access it, and dailyUpdate silently failed every day. Also, midway was still disconnected from the network. The second thing that went wrong was that midway got reconnected to the network. Someone threw a local/remote switch on a DELNI Ethernet connector. This happened some time in the previous few months. (A few months after this writeup circulated internally, we found out what had happened; someone had been fixing a broken workstation on the testbed network, and tested it by rejoining the networks rather than moving the workstation's network connection.) On Friday, 2/16/90, at 11:00 a.m., SRC had a power failure during heavy rains. This was the third thing to go wrong. When power came back, bigtop and midway both came up. Midway came up faster, being a smaller system. This was the first time in months that midway had booted while bigtop was down, and midway got to claim the dual-ported disk. Friday at 5:00 p.m., midway successfully ran dailyUpdate. It contacted SRC's name service, and made the name service contents consistent with its abbreviated /etc/passwd. Soon afterwards, when the authentication caches expired on their workstations, most people found themselves unable to do anything that required authentication: log in, run programs, even log out. (Richard and Lucille and I didn't notice anything wrong at first, because we were listed in midway's /etc/passwd. But Lucille received mail from the name service saying that root@midway had just made a bunch of changes, and I got calls from people asking, "What does it mean when it says, `Not owner'?") So Lucille and I went to the machine room (dodging the person installing some new locks), and looked at midway's /etc/crontab. Everything looked fine; no mention of dailyUpdate there. (It was much later we discovered the call to dailyUpdate was still in /usr/lib/crontab). Although we didn't know what had made dailyUpdate run, Lucille rethrew the DELNI switch to isolate midway from the network so it couldn't happen again. (If the person installing new locks had been a little ahead of schedule, we probably wouldn't have been able to get into the machine room, since we didn't have keys yet.) Lucille then ran dailyUpdate against a real copy of /etc/passwd, to get things to the point where everyone could log in. She discovered that there's an upper bound to the number of additions that dailyUpdate can make to the name service at once. This had never been a problem before, but it was a problem now. (Midway's dailyUpdate didn't have any problem with the same number of deletions.) Lucille finally coaxed dailyUpdate to run. Unfortunately, restoring information isn't as easy as deleting it, and even with a lot of hand editing, things still weren't great at 7:00 p.m., when Lucille and I both had to leave. Richard had left long before, as had Andrew Birrell, our main name server expert, but Lucille sent them mail explaining what had happened, and asking whether they could fix it. Ted Wobber and Andy Hisgen, two other name server experts, were both out of town for the weekend. When I got back at 10:30 p.m., I found the mail system was broken, probably as a result of the name service problems, so Lucille's mail hadn't been delivered and no one had done anything since 7:00 p.m. (The file system holding server logs had also overflowed, because of all the RARP failures caused by the name server outage.) By the time I brought the mail system back up, it seemed too late to phone anyone at home, so, after confirming that no one else was fixing things at the same time, I started to restore the name service contents from the previous night's incremental dumps. The name servers hold their state in VM, but keep stable copies in a set of files that they write into a directory from time to time. I found backup copies of these files on bigtop from incremental dumps made at 6:00 a.m. Friday. Fortunately, bigtop's name server had written a full set of files to disk between 6:00 a.m. Thursday and 6:00 a.m. Friday, or this wouldn't have worked. We didn't dump these directories on the other name servers, named "jumbo" and "srcf4c"; we had figured we didn't need to, since the contents are replicated. Even so, extra dumps might have come in handy if bigtop's dump had been unusable. We've now started dumping these directories on jumbo too, just in case. (I had to do this restore before 6:00 a.m. Saturday, since the incremental dumps are kept on disk, and each day's incremental dumps overwrite the previous day's.) So I reconstructed an appropriate directory for bigtop's name server, but couldn't do the same for jumbo or srcf4c. I killed all three name servers, installed the restored state on bigtop, and restarted bigtop's name server. At this point, SRC had a name service, but it wasn't replicated. I left the other name servers down, because the overnight skulkers would make the name servers consistent, and bigtop's old information would lose to the more recent information in the other servers. I sent out a message describing the state of the world and went home, figuring that things weren't really fixed, but that nothing bad could happen before I came in Saturday morning. Saturday morning, SRC had two more power failures during more rain. Jumbo, bigtop, and srcf4c all went down. Srcf4c has an uninterruptible power supply, but the batteries had probably gone flat, so it went down during each power failure. When power was restored, bigtop's name server came back up, but so did jumbo's and srcf4c's. I had only killed the running instances of the other servers, not uninstalled them, since I was tired and thought it wouldn't matter overnight. Ha! Jumbo's server rebooted automatically after the power came back. Perhaps as a result of the flaky UPS, srcf4c did not reboot, but a helpful passer-by rebooted srcf4c by hand. He hadn't read the electronic message I'd left, since he couldn't log in, and, in any case, figured that some inconsistency was better than total unavailability while waiting for jumbo and bigtop to check their disks and finish booting. Compounding Saturday morning's confusion, I got to SRC later than I had planned, not wanting to travel in the rain. At this point, users had a 2/3 chance of getting their data from a bad name server, and the bad servers were slowly propagating their contents to the good one. Fortunately, I had kept copies of the directory I had reconstructed on bigtop the night before (plus the contents before I overwrote it, plus copies of everything else I could find; I had known I was tired). Even more fortunately, Andrew and Richard agreed to come in. We killed all the servers, reset bigtop's contents and restarted its server, then Andrew used magic name service commands to erase the name service replicas on jumbo and srcf4c and create new ones, copying their contents from bigtop. And that fixed everything. Many thanks to everyone at SRC who helped understand the problem and to fix it. Thanks also to Jim Horning, Cynthia Hibbard, and Chris Hanna for reviewing this writeup. What were the lessons? Some might be: 1) Things break Fridays at 5:00 p.m., especially if it's a long weekend. (Although SRC as a whole didn't get that Monday off for President's Day, many people weren't back by then. Perhaps some were trapped in the Sierras after heavy snows and an avalanche closed the roads.) 2) The name service had been so reliable that there were few experts available to fix it. I'm not an expert, but I knew how it worked because I once released a faulty garbage collector that caused some name servers to lose data and eventually crash; I had done penance by fixing it. 3) You're always ready to fight the previous war. When I discovered the name server problems, my first reaction was that it was another garbage collector bug (even though the collector had been stable for about a year). Discovering that garbage collection had nothing to do with the problem wasted some time. 4) Ivy's inability to edit protected files may not be a big problem on the average, since those few users for whom this is a problem can work around it, but the workarounds can be dangerous. Moreover, these users didn't complain about this limitation to the Ivy developers; they devised the workarounds on their own. 5) After midway's /usr/lib/crontab got overwritten with a real file, it's unfortunate that Richard and Lucille followed the link in their heads and edited /etc/crontab, instead of editing /usr/lib/crontab and letting midway follow the link. Although a very similar situation had occurred two years earlier, neither one expected it to happen again. 6) SRC's name service allowed only one instance of the name service on the same network, virtually inviting this sort of collision of namespaces. Since then, Digital has developed product-quality name servers without this limitation, but we were running our own earlier experimental software. This limitation was probably a mistake waiting to strike, but it's a sort of mistake that's commonly made. 7) Although there are plenty of locks on the machine room, someone toggled the DELNI. Perhaps some network connectors should also have been unscrewed (and hidden). Again, this wouldn't have been a problem if we'd been using Digital's product software. 8) While the export snapshot was being built, Lucille was very careful to keep midway isolated from SRC's main network. Afterwards, she watched midway for a couple of days, making extra sure that it wasn't exporting its /etc/passwd contents. But she didn't watch it for months. Perhaps she should have reinstalled Ultrix on midway, deleting all old state. 9) Using dailyUpdate to keep the name service consistent with /etc/passwd seems cumbersome and error-prone. We may move toward a scheme where the name service drives /etc/passwd instead, since even catastrophes like this one would not lose information. 10) When I fixed things Friday night, I knew I was tired. As a result, I was very careful; I made copies of everything that might be overwritten. They might well have been overwritten even if I hadn't been tired, and I mightn't have had the copies. As I said at the beginning, the name service itself did not fail. However, some other parts of the environment were not as well thought out, and the end result was a loss of data held in the name service. Moreover, the experimental name server's limitation to one instance per network made it especially susceptible to failure caused by accidental network reconfiguration. John ------------------------------ End of RISKS-FORUM Digest 11.30 ************************