The BP Oil Spill: Could Software be a Culprit?
Computing Now Exclusive: Interview with Don Shafer
Author Don Shafer, chief safety and technology officer with the Athens Group, an independent software-consulting firm that provides risk-mitigation services to the offshore oil-drilling industry, details how software failures on oil rigs include mishandled software alarms, untested software, frozen computer screens, and lack of data recorders for oil rigs. Yet, the most important failure might be the lack of standards across the oil drilling industry.
Download the MP3 (29 MB)
No one yet knows what caused the Deepwater Horizon oil rig explosion that killed 11 workers and poured millions of gallons of oil into the Gulf of Mexico. But considering the fact that offshore oil rigs comprise dozens of complex subsystems that use or are controlled by software, is it possible a software failure could have contributed to this disaster?
In April 2010, an explosion occurred on the Deepwater Horizon oil rig operated by BP Petroleum and its subcontractor, Transocean. The explosion killed 11 workers, destroyed and sank the rig, and caused millions of gallons of oil to pour into the Gulf of Mexico, about 40 miles off of the Louisiana coast.
As we write this, it appears that the gushing well, about a mile under the sea, has finally been brought under control after more than three months of frenetic attempts. The total damage to the environment, economy, and future of deepwater fossil-fuel drilling in the US has yet to be determined.
While the American government and press are scrambling to assign blame for this catastrophe, no one knows what caused the explosion that led to this disaster. The cause might never be known. However, we wonder… could a software failure have contributed to this disaster?
Speculation of a Software Connection
We don't have access to all the data from this incident. However, Transocean's interim report, submitted to Representative Henry Waxman's committee in the US House of Representatives on 8 June 2010, stated the following under an "Action items/work needed" section: "Full control-system software review. Software code requested from manufacturer for investigation."1 Apparently, in studying the disaster, there's speculation of a software connection.
Additionally, an article appearing in the 19 July 2010 issue of the Houston Chronicle stated that "display screens at the primary workstation used to operate drill controls on the Deepwater Horizon, called the A-chair, had locked up more than once before the deadly accident."2 According to Stephen Bertone, Transocean’s chief engineer on the Deepwater Horizon, "Basically, the screens would freeze [and] all the data… would lock up." Bertone noted, however, that hard drives had been replaced, and he wasn't aware of any problems with the equipment the day of the accident.2
We'll learn more about software's role in this disaster as additional evidence surfaces. However, since one of us (Don Shafer) has extensive experience in software testing for oil rig technology and in post-incident analysis of fatal software-related equipment failures, let's speculate further about how a software problem could have caused the Deepwater Horizon incident.
Software Control on Oil Rigs
Offshore oil rigs comprise dozens of complex subsystems that use embedded software or are operated under software control (see Figure 1). For numerous reasons, each system is a potential point of failure.
For example, three rigs with the same design built over four years can end up with different equipment and software versions that might not integrate as expected. This could also lead to serious configuration-management problems.
Another problem is that much of the software residing in or controlling components is routinely delivered well after the equipment is onboard the rig. Engineers test the interfaces at the last minute — if they even test the software at all. Equipment interfaces thus present the weakest link in offshore oil rig systems in terms of reliability and safety, because the industry lacks interface standards and sufficient testing methods.3
Mishandled Software Alarms
There are several possible scenarios in which a software failure could have manifested as a serious rig failure. For example, any one of the many software systems controlling the equipment shown in Figure 1 could have started failing weeks or months before the Deepwater Horizon incident, causing accumulative problems. However, here we focus on one possible scenario — a mishandled alarm.
In an oil rig system, the many devices coupled with insufficient testing and alarm management lead to numerous software alarms popping up on the driller's work station (see Table 1). In some cases, up to 50 alarms can occur every 10 minutes,4 which is much higher than industry standards recommend.5–7
|Organization||Standard||Normal Alarm Rate||Peak Alarm Rate|
|The Engineering Equipment & Materials Users' Association||EEMUA 191||1 alarm per 5 minutes||Less than 10 alarms per 10 minutes|
|International Society of Automation||ISA 18.2||Approximately 2 alarms per 10 minutes||10 alarms per 10 minutes for less than approximately 2.5 hours per month*|
|Nowegian Petroleum Directorate||NPD YA 711||1 alarm per 5 minutes||Less than 1 alarm per minute|
|Abnormal Situation Management Consortium||Study results||Average rate: 2.3 alarms every 10 minutes||95% of unique consoles experience 31-50 alarms every 10 minutes|
*Adapted from ISA 18.2: 16.5.2. Calculation is based on an eight-hour workday.
Typical alarm issues include calibration errors, flooding, buried alarms, improper prioritization, nuisance alarms, and alarms that are missing altogether. Athens Group recently performed a Failure Modes, Effects, and Criticality Analysis (FMECA) report and found that vital alarms might not be acted upon in time for two main reasons:
- the alarms aren’t categorized by priority, or
- thousands or even tens of thousands of alarms are being displayed every day.
The typical number of potential alarms on a drilling rig is astounding. In one project, for example, researchers created an inventory of all alarms created by drill floor equipment. The final list was over 90 pages long and contained more than 2,700 alarms.8
There is, unfortunately, a great deal of precedent for software failures having a serious impact on rig equipment.9 Many of these failures led to loss of life, injury to rig hands, or expensive damage to equipment. Some of the failures led to environmental issues. To illustrate the relationship of missed alarms to certain failures in rigs, consider the following real cases.
Case 1: Buried Alarm
While tripping out — that is, pulling the drill pipe out of the well — a driller on an offshore rig arrived at the lower portion of the drill pipe. When you pull a drill pipe out of a well, a volume of mud equal to that which the pipe displaced must be pumped back in as a replacement. To ensure that this process was functioning correctly, the driller performed a flow check before continuing the process. No irregularities were found, so the driller continued.
When an alarm sounded to indicate that the trip tank (a receptacle used to ascertain the exact amount of mud displaced by the drill pipe) had overflowed, the assistant driller acknowledged the alarm, not realizing that it had indicated a full trip tank. The trip tank then overflowed, causing approximately 60 barrels of synthetic-based mud to be discharged overboard. An assessment of alarm priority and annunciation could have helped prevent this alarm from being overlooked.
Case 2: Missed Alarm
In this incident, a mud pump failed on a particular rig. The driller assumed that a bad sensor caused the problem, so he replaced the sensor and the mud pump. However, a subsequent failure of the replacement mud pump then occurred in the same manner as the first. An alarm had indicated the problem’s real cause, but it was buried so deep on the alarm screen that the driller never saw it.
Because no one had ever tested the possible alarms, no one knew that the driller was mishandling this alarm. As a result, unnecessary hardware was purchased. Additionally, the mud pump replacement resulted in the loss of productive time. A review of the alarms supported by the system — of their prioritization and their mapping into the human-machine interface on the driller's chair — could have prevented this.
Case 3: Alarm Calibration Error
During production on a Floating Production, Storage, and Offloading unit, the compressor flow transmitter offset began to increase. The change wasn't automatically detected, so operations personnel didn't notice it, and the offset continued to increase.
Sometime later, compressor vibrations changed signature, indicating bearing changes. The vibration remained below alarm level and thus went undetected. Later, the seal gas cavity temperature increased significantly during an aborted restart, but the temperature increase didn't trigger an alarm, so no corrective action was taken.
As a result, a fire occurred in the gas compression process module, halting production on a US$720 million venture. In this scenario, an alarm audit could have helped prevent this problem by identifying possible failures in alarm communication paths and ensuring that alarms trigger under the appropriate circumstances.
There is a strong possibility that the Deep Water Horizon driller or tool pusher in charge of the drill floor never saw one or more alarm signals issued in relation to the problems that eventually led to the explosion. Unfortunately, we can't reconstruct the software events leading to the BP spill disaster because there's no "black box" on the rigs. Some rigs carry a "flight recorder," but it only records some of the messages on the drilling control network. The subsea, power, and vessel-management system networks are totally ignored.
The 2010 Toyota Prius braking problem was very quickly blamed on a software failure, without proof. In fact, to date, no software defect has been found, and Toyota is now suggesting that driver error was most likely to blame.10 Yet no one is claiming that a software problem caused the BP oil disaster. Unfortunately, the oil industry is in the same state as the US manufacturing industry in the 1970s — that is, standards have yet to be fully adopted, and there are no universal risk-mitigation strategies or tight safety controls.
We might never know whether a software problem caused the 2010 Gulf oil spill, but there's clearly the need for better interface standards and software testing for oil rig technology. Perhaps in the future, software could help prevent a catastrophic spill from occurring.
- "Deepwater Horizon Incident—Internal Investigation," draft report, Transocean, 8 June 2010, p. 15; http://energycommerce.house.gov/cms/dlmig/dl/computingnow/li>
- B. Clanton, "Drilling Rig Had Equipment Issues, Witnesses Say—Irregular Procedures also Noted at Hearing," Houston Chronicle, 19 July 2010; www.chron.com/disp/story.mpl/business/7115524.html.
- "Can You Afford the Risk? The Case for Collaboration on Risk Mitigation for High-Specification Offshore Assets," white paper, Athens Group, Feb. 2010.
- Achieving Effective Alarm System Performance: Results of ASM Consortium Benchmarking against the EEMUA Guide for Alarm Systems, Abnormal Situation Management Consortium, Feb. 2005; www.applyhcs.com/publications/interface_design/EffectiveAlarmSystemPerformance_CCPS05.pdf.
- EEMUA 191: Alarm Systems: A Guide to Design, Management and Procurement, Eng. Equipment and Materials Users’ Assoc., 1999.
- ISA 18.2: Management of Alarm Systems for the Process Industries, Int’l Soc. Automation, 2009.
- NPD YA-711: Principles for Alarm System Design, Norwegian Petroleum Directorate, 2001.
- Athens Group, "How to Stop the Flood of Superfluous Alarms and Achieve Alarm Management Compliance," to appear in Proc. Int’l Assoc. Drilling Contractors World Drilling Conf., 2010.
- D. Shafer, "Would You Like Software with That? Where Do We Stand with Oil and Gas (O&G) Exploration and Production (E&P) Software?" white paper, Athens Group, 2005; http://athensgroup.com/nptqhse-resources/articles-and-whitepapers.html.
- J. Garthwaite, "It Wasn't the Software: Toyota Finds Driver Error (Not Code) to Blame," Earth2tech.com, 14 July 2010; http://earth2tech.com/2010/07/14/toyota-finds-driver-error-not-software-to-blame-in-some-runaway-cars
About the Authors
Don Shafer is Chief Safety and Technology Officer of the Athens Group. His research interests include software safety and configuration management. Shafer has an MBA from the University of Denver and is a Certified Software Development Professional (CSDP). Contact him at email@example.com.
Phillip A. Laplante is professor of software engineering at Pennsylvania State University. His research interests include software project management, software testing, requirements engineering, and cyber security. Laplante has a PhD from the Stevens Institute of Technology and is a licensed professional engineer in Pennsylvania and a CSDP. He’s a fellow of IEEE and a member of the IEEE Computer Society’s Board of Governors. Contact him at firstname.lastname@example.org.