Been meaning to post 3 more presentations from the Darlington hearings. Here they are, finally - starting with this one, Louis Bertrand's on April 1st. Most of us don't understand the limitations of computer software - but Louis sure does!! Btw, you can find the April 1st transcript here (audio here )
Concerns with software based instrumentation & control systems
Mr. Chairman, members of the panel, good morning.
My name is Louis Bertrand. I am a professional engineer and I live in Bowmanville. My engineering experience is in electronic product design including embedded software as well as information technology and information security
M. le président et membres de la commission, je vous souhaite bonjour. Je m’appelle Louis Bertrand. Je suis ingénieur professionel et j’habite Bowmanville. Mon experience en génie comprend le design de produits électroniques, ainsi que l’informatique et la sécurité des données.
My presentation this morning will deal with my concerns regarding the safety and reliability of instrumentation and control systems based on embedded microcontrollers and the software running them.
Ma présentation ce matin traite de mon inquietude au sujet de la securité et de la fiabilité des systemes de saisie de données et de controle a base de logiciels pour microprocesseurs imbriqués. A cause des termes techniques, je dois continuer ma présentation en anglais mais si on me pose une question en francais, j’essayerai dans la mesure du possible d’y répondre pareillement.
The New Nuclear Darlington Environmental Impact Statement section 7 submitted by the proponent considers the mitigation and effects of accidents, malfunctions and malevolent acts. It is my observation that the language used to describe these potential events shows that the designers consider them highly unlikely. However, the increased complexity and failure characteristics of software based instrumentation and control systems (I&C) leads me to ask whether or not some new scenarios for accident initiating events have been overlooked or underestimated.
The EIS and additional responses provided by the proponent make reference to several software quality assurance standards such as CSA N290.14 (Qualification of Pre-Developed Software) and CSA N286.7-99 (Quality Assurance of Analytical, Scientific, and Design Computer Programs) as well as AECB draft regulatory guide C-138(E) (Software in Protection and Control Systems). However, the guidance in those documents is prescriptive and they cannot provide the level of detail and completeness currently required to develop safety critical software and firmware systems.
Coffee Mug (c. 1982)
Weinberg’s Law: If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.
It also concerns me that an article on forensic engineering, the discipline of failure analysis, in January/February 2011 edition of Engineering Dimensions, the magazine of Professional Engineers Ontario, does not mention software as a potential factor in failures (Mastromatteo).
Yet software failures occur on a regular basis and occasionally lead to serious injury or death, as the 1986 Therac-25 accidents demonstrated (Leveson, 2006).
An Investigation of the Therac-25 Accidents
Author(s): Nancy G. Leveson and Clark S. Turner (abstract by Philip D. Sarin)
The Therac-25, a computerized radiation therapy machine, massively overdosed patients at least six times between June 1985 and January 1987. Each overdose was several times the normal therapeutic dose and resulted in the patient's severe injury or even death. Overdoses, although they sometimes involved operator error, occurred primarily because of errors in the Therac-25's software and because the manufacturer did not follow proper software engineering practices.
Overconfidence in the ability of software to ensure the safety of the Therac-25 was an important factor which led to the accidents. The Therac-20, a predecessor of the Therac-25, employed independent protective circuits and mechanical interlocks to protect against overdose. The Therac-25 relied more heavily on software. Moreover, when the manufacturer started receiving accident reports, it, unable to reproduce the accidents, assumed hardware faults, implemented minor fixes, and then declared that the machine's safety had improved by several orders of magnitude.
The design of the software was itself unsafe.
Obviously, since that series of tragic accidents, the discipline of software verification and validation has made great strides. However, regulatory agencies are still required to maintain oversight of providers of safety critical software, as occurred in a recent case of radiation therapy equipment malfunction (Bogdanich).
April 8, 2010
F.D.A. Toughens Process for Radiation Equipment
By WALT BOGDANICH
The Food and Drug Administration said Thursday that it was taking steps to reduce overdoses, underdoses and other errors in radiation therapy by strengthening the agency’s approval process for new radiotherapy equipment.
In a letter to manufacturers, the F.D.A. said its action was based on a recent analysis of more than 1,000 reports of errors involving these devices that were filed over the last 10 years.
The F.D.A. will no longer allow new radiotherapy equipment to enter the market via a streamlined approval process that sometimes involved the use of outside, third-party reviewers, Dr. Alberto Gutierrez, the F.D.A.’s director of in vitro diagnostic device evaluation and safety, said in an interview. That process, he said, was instituted in the 1990s to reduce the agency’s workload and speed approval time.
Most of the reported problems — 74 percent — involved linear accelerators, computer-controlled machines that generate high-powered beams of radiation that target and destroy cancer cells.
Problems with computer software were most frequently cited as a cause for the errors, according to the letter sent Thursday by Dr. Jeffrey Shuren, director of the agency’s Center for Devices and Radiological Health.
Software quality assurance standards promoted by CSA, the US DOE and other public safety agencies are part of the requirements for safety critical software. Nonetheless, it is reasonable to ask if current methodologies have kept pace with increasing complexity.
The problem of identifying postulated initiating events (PIE) has been considered as a key issue in the safety of new nuclear reactors (TSO, section 4.3). Since the PIEs drive the design and acceptance criteria, it is important to identify as many of them as possible. Chapter 7 of the EIS details several postulated accident scenarios but they involve physical accidents or mechanical failures, not software or firmware malfunctions.
Since 1993, when the Darlington NGS was completed, software and computer technology has blossomed to provide us with a globe-spanning Internet, mobile devices and new integrated circuit technology. The complexity of software systems is ever increasing, as is the pace of change in the platforms for development and operation.
Safety approaches in the nuclear industry has been to make cautious incremental changes in design and operating procedures.
(Nancy G. Leveson, 2003)
“Although the terminology differs between countries, design basis accidents for nuclear power plants in the U.S. define the set of disturbances against which nuclear power plants are evaluated. Licensing is based on the identification and control of hazards under normal circumstances, and the use of shutdown systems to handle abnormal circumstances. Safety assurance is based on the use of multiple, independent barriers (defense in depth), a high degree of single element integrity, and the provision that no single failure of any active component will disable any barrier. With this defense-in-depth approach to safety, an accident requires a disturbance in the process, a protection system that fails, and inadequate or failing physical barriers. These events are assumed to be statistically independent because of differences in their underlying physical principles: A very low calculated probability of an accident can be obtained as a result of this independence assumption. The substitution of software for physical devices invalidates this assumption, which has slowed down the introduction of computers (although it has increased in the last few years).”
The entire support system for the software operating devices and systems in the generating station, including the physical hardware, networking environment, operating system and development tools, is in itself a complex system that must be examined as an extension of the generating facility itself. The development tools include editor, compiler, testing suite as well as the library of pre-existing modules necessary to support the actual programs. Those library modules, which may be developed by third parties, provide communication, user input, display and computation for the control software, as well as device drivers.
Taken together, this collection of hardware, software and network components is at least as complex as the operation of a nuclear reactor, the generating apparatus and their auxiliary systems. I believe there is cause for concern about the specification, design, validation and verification, and long term maintenance of this collection of systems.
2.Dealing with complexity and the potential for software errors
2.1.Hardware and soft errors
Integration densities are such that entire microprocessor systems can be built on a single system on chip (SOC). However, constantly shrinking integrated circuit geometries and lower operating voltages mean that these systems are more susceptible to soft errors caused by ionizing radiation and electromagnetic interference. This should be flagged as a common cause risk that could potentially affect any software-hardware system or device.
Contemporary SOC microcontrollers integrate CPU, EPROM to store the program binary code and control coefficients, sufficient RAM to run the program as well as necessary peripheral devices: analog-to-digital converters, timers, digital inputs and outputs and communication interfaces. The level of integration comes from reducing the geometry of transistors and interconnects on chip, as well as reducing the power dissipation of individual transistors by lowering the supply voltage to 3.3 volts or lower. These operating voltages are significantly lower than earlier standards.
With smaller IC geometries and lower voltages, the risk of soft errors caused by ionizing radiation is increased. A single event upset (SEU) occurs when an ionizing particle injects a current in a transistor sufficient to change the state of a memory element (Baumann, 2004). There are two modes for a soft error to occur. The first involves the direct change of a binary memory element (flip-flop, static or dynamic RAM cell) to its opposite state (“zero” to “one”, or vice versa). In the second, the ionizing radiation causes a combinational circuit to exhibit a transient incorrect output. If the transient persists across a clock edge, this transient state can be latched by a memory element and becomes an SEU. The higher the system clock frequency, the more likely the transient will be clocked in by a memory element.
Although the major concern about radiation exposure is for military or space based systems (satellites, probes), exposure at ground level is expected from background radiation as well as cosmic rays. Operation inside a nuclear facility increases the likelihood of soft errors (National).
The reduced size of transistors, lower operating voltages and increased CPU clock frequencies can increase the probability of soft errors in embedded microcontrollers powering mission critical devices. A system in which many similar devices with the same microcontroller type, or even the same semiconductor process technology, could be vulnerable to common cause failure due to the internal operation of the microcontroller.
As the number of microcontroller based instruments and control systems increases, so does the complexity of the software operating each one. The need to validate and verify the software becomes more important while at the same time becoming more difficult.
The first challenge is validation, which asks if the software correctly models the desired behaviour (Kelly, 2008). Subsequently, the challenge is to verify that the software is developed to the specifications required by the model.
The validation challenge involves subject matter experts in nuclear operations communicating their requirements to software developers, and the software developers in turn successfully translating those requirements into correctly operating programs.
Testing requires several concurrently applied techniques (Kelly, 2008; AECB, 1999):
·Regression testing: over time, tests and procedures are developed that test for the resolution of known problems and defects. The collection of tests is systematically applied to new versions to ensure that previous issues were not inadvertently re-introduced by the latest modifications;
·Code inspection: the source code is verified by others independent of the original programmers;
·Formal methods: methods to prove correctness such as those used by David Parnas in the control software for the existing Darlington NGS (Kelly, 2008);
·Randomized testing: a randomly selected sequence of inputs is presented to the software under test in an effort to flush out the most likely failures.
However there is no guarantee that these methods will detect and prevent all potential initiating events due to software defects.
An unforeseen consequence of networking safety critical systems with other systems was discovered as a result of a SCRAM incident at the Browns Ferry 3 reactor (NRC, 2007). Excessive network traffic caused a variable-frequency drive controller for a pump to malfunction. The abnormal network traffic was due to the failure of another device, a condensate demineralizer, on the same network that flooded the network with packets.
A word about how network devices operate. When a device receives a data packet, it must read the packet from the network and examine the destination address to decide whether or not it is the intended recipient and if it should receive the packet. If not, the device simply discards the packet. Even though most of the network traffic was not intended for the VFD controller, it had to devote some processing time to examine each incoming data. The extra processing load overwhelmed the controller and caused it to become unresponsive. The VFD controller was thus unable to process a command to increase the flow of cooling water and the control room procedure called for a manual SCRAM.
The problem was resolved by partitioning the network with firewalls to isolate the safety critical systems from the rest of the network and limit the amount of traffic the device could see on its wire. However, it's only in hindsight that the solution at Browns Ferry 3 seems obvious. It is standard practice to compartmentalize networks using firewalls and routers to isolate subnets within an organization to limit the spread of computer worms and automated attacks.
This begs the question, what about the future? What network problems will arise in new networks as more data is transferred over IP networks instead of discrete wiring? What happens to realtime requirements with more diverse traffic? Networks nowadays can carry voice and video, in addition to the traditional instrumentation and control data streams. The number of networked devices is far greater, multiplying the number and nature of networked interactions between software based devices.
Programmable logic controllers (PLC), ubiquitous in process control applications, are not immune to the ramping up of software complexity. Most now use embedded microcontrollers to execute programs compiled from on-screen representations of ladder logic networks. The ladder logic compiler used by the designer must meet the criteria set out in standards for design programs (for example, CSA N286.7-99). In addition, there must be assurance that PLC firmware will execute the compiled program correctly. A common cause fault in the PLC firmware that executes the simulated ladder logic diagram could cause all controllers with similar firmware to fail under the same circumstances. PLCs are networked with dedicated embedded microcontrollers as well as control consoles and data recorders, bringing an additional level of risk to their operation.
2.5.Maintenance over the life cycle of the station
The operating span of the NND is expected to be 60 years before decommissioning. 60 years ago, stored program computers were experimental oddities mostly powered by vacuum tubes.
Programmers in the 1970s would have scoffed at the idea that their COBOL programs would still be in use a quarter century later and causing anxiety at the possibility of programs suddenly finding themselves in the year 1900 the day after December 31, 1999. The point is that the pace of technological change is so fast that the current design would have to be "future proof", an impossible task.
Another serious issue is maintaining the development system for the devices in use at the generating station over the lifetime of the devices themselves if any maintenance, bug fixes or other modifications to the running program are required. The woes of maintaining obsolete hardware and operating systems are compounded by the need to maintain the programming environment virtually frozen in time. The development knowledge of the original programmers must also be captured as part of the development environment.
3.Threats and attacks
The common cyber-attacks reported on the news would not be expected to affect safety critical systems as it is assumed that they are isolated from the Internet, an elementary precaution.
However, the possibility of a successful attack, though remote, cannot be dismissed as a “not credible”. Several factors could enable such an attack:
·Increased availability of small wireless personal devices (smart phones, wireless PDAs and tablets). As those devices become smaller yet more powerful, it is not unrealistic to postulate an attack from inside mediated by a wireless access point unwittingly installed against network management rules.
·Ubiquitous small portable memory devices able to introduce malicious programs (a.k.a. viruses) into the protected network environment
·A successful “publicity” attack on a non-safety related computer (e.g. air sampling beyond the fence line) could damage the proponent's reputation for safety. Any protestation that the system in question was of trivial importance would be lost in the noise resulting from a newspaper headline that screams “Nuke plant computer hacked”.
3.1.Future threats and attacks
Cryptographic protocols that depend on computationally expensive attacks for their security must not only offer protection against current attacks, but those expected in the future, when exponentially faster processors become available. A recent development is widely distributed computing over the Internet, as pioneered by the SETI@Home project (SETI@Home). Thousands of otherwise idle computers could be harnessed to recover encryption keys for secured communications, for example those that enable virtual private networks (VPN) access internal networked computers over the Internet.
Although the proponent has spelled out mitigation measures for various accident, malfunction and malevolent act scenarios, the use of expressions like "not credible" or "beyond design basis" would make an information security expert cringe. Such language gives the impression that events will unfold in an orderly and predictable manner, and generating station personnel only need to refer to their training scenarios to respond to any foreseeable emergency.
Software faults don't follow obvious rules. A soft error in a critical section of code can have an unpredictable effect. A common cause error triggered by a rare combination of inputs could affect a number of devices running similar hardware or firmware.
Attackers don't follow rules. Actually, they deliberately break the rules. Computers have given them the tools to make complicated attacks easy by automating the procedures into attack scripts. The Internet has made it easy to attack any other computer on the Internet since they are all virtually next door to each other (Schneier). Isolating safety critical networks from the Internet is a natural precaution but there can be no guarantee that the supporting systems are sheltered from attack.
It is not sufficient to test for expected conditions because security flaws are often in code that is rarely executed, or conditions that never naturally arise.
3.3."What if" thinking
The only way to identify postulated initiating events (PIE) due to malicious software is to change one's frame of mind from "not credible" to start asking open ended stimulating questions like, “if it were to happen, how could it start?”
"What if" thinking requires designers to put themselves in the roles of attackers, similar to what penetration testing professionals do to audit network security for their clients.
This kind of thinking is creative, playful and hopes to break rules. By engaging in this kind of exercise, the mind is freed of pre-conceived notions of what's possible and what's not. "One-in-a-million" events can suddenly become more probable, or links between apparently unrelated events and conditions can be seen as part of a larger chain of causality that could potentially lead to an accident.
To illustrate this, let me describe a commonplace programming error known as the buffer overflow attack, so called because it causes a data to be copied beyond the allocated bounds for a string of text characters. The text characters copied beyond the bounds are likely to overwrite data that belongs to another part of the program, unrelated to the text buffer itself. This behaviour is what makes software errors difficult to analyze, and with consequences even harder to predict.
Our hypothetical programmer expects that passwords are never more than a hundred characters long. For safety, he allocates 1,000 characters for his buffer. The attacker asks "what happens if the password contains more than 100 characters?" The program is safe up to 1,000. But what happens when the attacker supplies 10,000 characters? Attackers break rules.
This technique has been one of the most prevalent attacks on the Internet and it is devastatingly effective, often leading to a complete takeover of the system by the attacker (Schneier,P.207). Conventional testing would not detect this error. In normal operation, a reasonable length password is presented and either accepted as valid or rejected. It's only when absurd input is provided that the program fails.
What if the compiler on a software developer's workstation was compromised to inject malicious code in all programs processed by the compiler? At the binary code level, the effects of the change would be hard to detect because the code is not human readable.
It is important to attempt to foresee all possible attacks because, as defender, all defenses must be impenetrable. For the attacker, the job is simpler: only one attack needs to succeed.
4.Conclusion and recommendations
My submission presented concerns that I believe are credible and realistic considering the current state of the art of software development, the complexity of embedded operating systems and control programs, and ubiquitous networking.
Therefore I strongly recommend that this panel reject the proponent's application unless the proponent can supply a realistic and practicable plan for safety critical software and firmware that:
·Tests the finished software or firmware against unusual or absurd input conditions or states, in order to flush out hidden defects that could be exploited by a malicious attacker.
·Runs probabilistic tests to simulate soft errors due to single event upsets caused by ionizing radiation in low power, high integration digital integrated circuits.
·Detail the threat and risk assessment methodology to identify software based postulated initiating events.
·Outlines the management approaches that would be in place to ensure that the configuration of software and firmware based devices and that of the network itself as documented and that changes to individual components and network topology are managed through a suitable review and deployment process.
·Maintains the software development tools throughout the lifecycle of the software itself, and that future replacement software be developed respecting the original requirements and any additions or adjustments thereto. If the development tools are upgraded or migrated to a newer development platform, the plan should detail how the upgraded tools will be tested to produce correct binary code.
Some final thoughts
There are some people in this province who have convinced themselves of some pretty remarkable things. Some have convinced themselves that nuclear is unquestionably safe, while others have reviled wind power as harmful to health and the environment. Beliefs such as these stand reality on its head.
Without presuming what this commission will decide or how, I would ask that a critical look be applied to the unspoken assumptions that the nuclear industry has thought of all the threats and risks.
The discipline of risk assessment itself should come under scrutiny. To my understanding, in its simplest form, risk assessment attempts to model the likelihood of a harmful event and the consequences of such an event. It’s a simple multiplication. The result is then balanced against the potential benefits to society and provides the basis for a go / no-go decision, or the expense and effort of additional mitigation.
It information technology, if I have a web server that services 100 clients, and I know that the probability of a successful attack is one per year, and I also know that it costs me $10,000 in staff time and compensation to my clients for downtime for each attack, I can quantify this risk into a dollar amount and use that to estimate the worth of prevention or mitigation measures: in this case, $10,000 / year is my cost. It would make sense to buy a backup tape drive for $5,000 if I knew it would mitigate by restoring my server faster. Could I justify spending $20,000 on a firewall and intrusion detection system?
With nuclear, the calculation goes off the rails. The probability of an accident is admittedly very low. The consequences would not only be tragic, but extremely costly to the station, the surrounding area and to the economy of the province and of Canada. The simple multiplication no longer applies. You are multiplying infinitesimal probabilities with enormous damages to get an intermediate number. However, because of the difficulty in estimating either factor, the result is meaningless.
At a presentation to Clarington Council in 2009, Dr. Chris Olsson (Stantec) told the council in response to a question that “Risk assessment is not the science to tell you that it is safe”.
A Word About Fukushima
In the news, there is talk about the 50 (or is it 300) nuclear workers who are desperately battling to restore the failing systems in the damaged reactors. Their families are justifiably concerned for their health and safety.
To me this personalizes the nebulous side effects of nuclear power. We know that someone, somewhere will get sick because of radioactive emissions, but we can’t tell whether or not a particular case affecting a specific person was caused by nuclear power.
In the case of Fukushima, the causes and effects are tragic and my heart goes out to those workers and their families.
The accident also demonstrates that we are playing with forces that, if they escape the normal control parameters, are clearly beyond our ability to control - especially with something as fragile as computer software.
Mr. Chairman, members of the panel, I thank you for your attention and welcome your questions.
M. le president, commissionaires, je vous remercie de votre attention et j’acceuille bien vos questions.
AECB - Atomic Energy Control Board. "Software In Protection And Control Systems", Draft Regulatory Guide C-138 (E), October 1999
Baumann, R.C. "Soft Errors in Commercial Integrated Circuits", International Journal of High Speed Electronics and Systems, Vol.14 No.2 (2004) 299-309 (In Schrimpf, R. D. and D. M. Fleetwood, "Radiation effects and soft errors in integrated circuits and electronic devices", World Scientific, 2004 - ISBN 981-238-940-7
Bogdanich, Walt. "F.D.A. Toughens Process for Radiation Equipment", The New York Times, April 9, 2010, on page A12 of the New York edition. http://www.nytimes.com/2010/04/09/health/policy/09radiation.html?_r=1 (Viewed Feb. 19, 2011)
CSA N286.7-99, "Quality Assurance of Analytical, Scientific, and Design Computer Programs for Nuclear Power Plants", Canadian Standards Association, March 1999 (Cited by Kelly, 2008)
Leveson, Nancy G. "White Paper on Approaches to Safety Engineering"April 23, 2003 http://sunnyday.mit.edu/caib/concepts.pdf (Viewed Feb 20, 2011)
Leveson, Nancy G. and Clark S. Turner (abstract by Philip D. Sarin), "An Investigation of the Therac-25 Accidents" Online Ethics Center for Engineering 2/16/2006 National Academy of Engineering http://www.onlineethics.org/Resources/Cases/therac25.aspx (Viewed February 20, 2011)
Kelly, Diane and Rebecca Sanders. "Assessing the Quality of Scientific Software" , First International Workshop on Software Engineering for Computational Science and Engineering, Leipzig Germany, May 2008.
Mastromatteo, Michael. "Engineering detectives go to the heart of the matter," Engineering Dimensions, Professional Engineers Ontario, January/February 2011
OPG Response, EIS IR 54 (Resubmission) and IR213:Regulatory Documents, Codes and Standards, Appendix 1B to Attachment A, File name: "9 July 2010a.pdf"
·C-138(E) Software in Protection and Control Systems (October 1999)
Codes and Standards
·CSA N290.14 Qualification of Pre-Developed Software for use in Safety-Related Instrumentation and Control Applications in Nuclear Power Plants
·CSA N286.7-99, Quality Assurance of Analytical, Scientific, and Design Computer Programs for Nuclear Power Plants, Canadian Standards Association, March 1999 (Also cited by Kelly, 2008)
National Semiconductor, "Radiation Owners Manual" Undated http://www.national.com/analog/space/rad_ownersman (Viewed Feb 20, 2011)
NRC - Nuclear Regulatory Commission, "Effects of Ethernet-Based, Non-Safety Related Controls on the Safe and Continued Operation of Nuclear Power Stations", US NRC Information Notice 2007-15, April 17, 2007.
Schneier, Bruce. "Secrets and Lies: Digital security in an networked world", John Wiley & Sons 2000 ISBN 0-471-25311-1
SETI@Home project "About SETI@Home", undated, http://setiathome.berkeley.edu/sah_about.php (viewed Feb 20, 2011)
TSO "TSO Study Project on Development of a Common Safety Approach in the EU for Large Evolutionary Pressurized Water Reactors", 2001, EC EUR 20163