, Columbia University
Pages: pp. 96
There's a saying (attributed to, among many others, Ralph Waldo Emerson, Eleanor Roosevelt, and Alfred E. Neuman of Mad Magazine fame), "Learn from the mistakes of others—you'll never live long enough to make them all yourself." One reason that airplanes are so safe is that crashes are investigated by government agencies, the results are published, and the lessons from one crash go into future airplane design, pilot training, and technology to prevent another. There's nothing like that in the computer world, which is one reason why we see the same failures over and over again. I suggest that we need a Major Cyberincident Investigations Board, which would be charged with performing—and publishing—detailed analyses of the causes of significant computer-related incidents. I'd define "incident" broadly, ranging from serious security problems to noticeable service outages of major websites to power outages that shouldn't have been able to bring down a datacenter but somehow did.
This scheme will face some obvious objections, starting with the assertion that no two computer systems are alike, thus little could be learned. I don't buy that. When you read reports analyzing major system failures, be they airplane crashes, major blackouts, or nuclear reactor problems, you see that no single factor caused the problems. Instead, you find a chain of independent problems, often including design flaws, hardware failures, software glitches, human error, organizational culture, and just plain bad luck; fixing any one of these would have let the system survive. I agree that it's unlikely for any two systems to share identical configurations, but it's highly likely that they have some common elements, including common errors implicated in failures. Internal investigations tend to focus on the immediate technical causes or causes of problems; they rarely look at the larger picture, especially if it tends to inculpate long-standing habits or policies.
Another objection is that companies won't cooperate with such investigations. It might indeed be the case that we'll need new laws or regulations, but there's ample justification for them. For regulated industries or critical infrastructure, the rationale is quite clear; for publicly traded companies, an incident of the scale I'm talking about is certainly large enough to be a "material adverse event" that must be brought to shareholder attention under today's rules. (If a company doesn't want its system designs published, it has a very simple strategy to follow: avoid catastrophic failures.)
Defining "major" is admittedly challenging but not a showstopper if approached by sector. Some industries already have requirements to report incidents of a certain magnitude; this is just an extension of that. A triage function is probably necessary, with smaller incidents dismissed and moderate-sized ones slated for internal (but published) investigation.
The profession stands to benefit in ways other than just the obvious. For one thing, detailed systems designs will be publicly available. We'll get real data that will teach us what really causes failures, which in turn tells us the value of assorted preventive or recovery measures.
More than 40 years ago, Peter G. Neumann noted, "We profit neither from our mistakes nor from our successes. After a success, we tend to go out and make a new collection of mistakes (such as trying to build a grand design after a small system)" ( www.multicians.org/pgn-motherhood.html). In the same paper, he also pointed out, "Few papers analyze carefully and conclusively the results of a system development." A Major Cyberincident Investigations Board should help with both problems.