Issue No. 05 - Sept.-Oct. (2012 vol. 10)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MSP.2012.115
Robin Bloomfield , City University London
In the past few weeks, we've seen a wave of computer-related system failures, including incidents in the banking sector, mobile phone outages, metro systems' unavailability, and rogue algorithms. As systems become more complex, there might be limits to what we can readily learn from analyzing such incidents' immediate causes. Anne Wetherilt and I just completed a study addressing engineering and systemic risk ("Computer Based Trading and Systemic Risk: A Nuclear Perspective," to be published by Foresight; www.bis.gov.uk/foresight/our-work/projects/current-projects/computer-trading/working-paper). In this study, we try to gain insights from complex adaptive systems. If the complex systems advocates are right, some systems' underlying processes might be such that big incidents are just small ones that have gotten out of control, so short-term monitoring and reaction are doomed to failure. Many incidents will have banal causes but disturbing consequences. The best we can do, the theory goes, is detect consequences early and provide "circuit breakers" to stop the damage quickly and limit cascade failures.
However, other areas of complex adaptive systems research stress the need to find long-term correlated behaviors that push the system into vulnerable states. In hindsight, we can spot these—the financial bubbles, the "accidents waiting to happen"—but how do we know when we're approaching such states? In some systems, there might be a calm before a storm, so a lack of bad news would be almost as disturbing as the bad news itself.
So how do we get perspective? How can we contrast failures with the day-to-day success of many services, or the success of the London Olympics' infrastructure, for instance? Some basic questions endure: How frequent are computer-based incidents? Are incidents getting worse? Are there any surprising side effects that reveal unanticipated connections, novel mitigations, or successful recoveries from which we can learn? Should we analyze the smaller, more frequent incidents for data on connectivities that could provide key understanding of complex behaviors, possible large-scale disruptions, and potential opportunities for innovation? Or should we focus instead on understanding the successes?
Perhaps the reliability and dependability community is a bit behind the technology curve. We frequently hear about the opportunities and problems of "big data," but it would be nice to turn this into "big evidence." Are there smarter ways to find out what's really going on?
Search data can be used to track the spread of disease (or at least the fear of disease); the absence of search can identify areas impacted by network loss. We could harvest mobile phone data for population dynamics. The range of sentiment and intelligence analysis systems is intriguing, but in the area of dependability, they're largely untried. Making use of this data will cause privacy and institutional issues and potential conflicts between dependability and security.
To get evidence, we need more theories and models to help us understand what's needed and which hypotheses to test. Can we do for a wide range of systems what the researchers at the Financial Crisis Observatory at ETH in Zurich are doing—rigorously test and quantify our ability to predict and publish digital signatures of predictions to allow for independent evaluation?
There's more to this issue than retrospective analysis and incremental learning. Today's systems might be purposefully designed, but they'll evolve and grow. The key to future dependable infrastructure and services is to align incentives (for example, for profitability, resilience, structure, and innovation) to ensure that initial systems are adaptable, then shape their growth so that services and systems develop with the right levels of resilience and redundancy. However, we can't effectively design incentives without evidence and smarter, scalable ways of getting and sharing it.