This Article 
 Bibliographic References 
 Add to: 
A Markovian Dependability Model with Cascading Failures
September 2009 (vol. 58 no. 9)
pp. 1238-1249
Srinivasan M. Iyer, Exalead, Inc., San Francisco
Marvin K. Nakayama, New Jersey Institute of Technology, Newark
Alexandros V. Gerbessiotis, New Jersey Institute of Technology, Newark
We develop a continuous-time Markov chain model of a dependability system operating in a randomly changing environment and subject to probabilistic cascading failures. A cascading failure can be thought of as a rooted tree. The root is the component whose failure triggers the cascade, its children are those components that the root's failure immediately caused, the next generation are those components whose failures were immediately caused by the failures of the root's children, and so on. The amount of cascading is unlimited. We consider probabilistic cascading in the sense that the failure of a component of type i causes a component of type j to fail simultaneously with a given probability, with all failures in a cascade being mutually independent. Computing the infinitesimal generator matrix of the Markov chain poses significant challenges because of the exponential growth in the number of trees one needs to consider as the number of components failing in the cascade increases. We provide a recursive algorithm generating all possible trees corresponding to a given transition, along with an experimental study of an implementation of the algorithm on two examples. The numerical results highlight the effects of cascading on the dependability of the models.

[1] D.J. Watts , “A Simple Model of Global Cascades on Random Networks,” Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 5766-5771, 2002.
[2] E. Coffman , Z. Ge , V. Misra , and D. Towsley , “Network Resilience: Exploring Cascading Failures within BGP,” Proc. Allerton Conf. Comm., Control and Computing, Oct. 2002.
[3] I. Dobson , B.A. Carreras , V.E. Lynch , and D.E. Newman , “Complex Systems Analysis of Series of Blackouts: Cascading Failure, Critical Points, and Self-Organization,” Chaos, vol. 17, p.026103, 2007.
[4] J. Chen , J. Thorp , and I. Dobson , “Cascading Dynamics and Mitigation Assessment in Power System Disturbances via a Hidden Failure Model,” Int'l J. Electrical Power and Energy Systems, vol. 27, pp. 318-326, 2005.
[5] K.S. Trivedi , Probability and Statistics with Reliability, Queueing, and Computer Science Applications, second ed. Wiley, 2001.
[6] J.K. Muppala , R.M. Fricks , and K.S. Trivedi , “Techniques for System Dependability Evaluation,” Computational Probability, W.K.Grassmann, ed., Kluwer, pp. 445-480, 2000.
[7] A. Goyal and S.S. Lavenberg , “Modeling and Analysis of Computer System Availability,” IBM J. Research and Development, vol. 31, pp. 651-664, 1987.
[8] A. Goyal , P. Shahabuddin , P. Heidelberger , V. Nicola , and P.W. Glynn , “A Unified Framework for Simulating Markovian Models of Highly Dependable Systems,” IEEE Trans. Computers, vol. 41, no. 1, pp. 36-51, Jan. 1992.
[9] R.A. Sahner , K.S. Trivedi , and A. Puliafito , Performance and Reliability Analysis of Computer Systems. Kluwer, 1996.
[10] M.K. Smotherman , J.B. Dugan , K.S. Trivedi , and R.M. Geist , “The Hybrid Automated Reliability Predictor,” AIAA J. Guidance, Control and Dynamics, vol. 9, pp. 319-331, 1986.
[11] A.K. Somani , J.A. Ritcey , and S.H. Au , “Computationally Efficient Phased Mission Reliability Analysis for Systems with Variable Configuration,” IEEE Trans. Reliability, vol. 41, no. 4, pp. 504-511, Dec. 1992.
[12] R.W. Butler , “The SURE Reliability Analysis Program,” Proc. AIAA Guidance, Navigation, and Control Conf., pp. 198-204, 1986.
[13] S. Bernson , E. de Souza e Silva , and R. Muntz , “A Methodology for the Specification of Markov Models,” Numerical Solution to Markov Chains, W. Stewart, ed., pp. 11-37, 1991.
[14] C. Beounes , M. Aguera , J. Arlat , S. Bachmann , C. Bourdeau , J.-E. Doucet , K. Kanoun , J.-C. Laprie , S. Metge , J.M. de Souza , D. Powell , and P. Spiesser , “SURF-2: A Program for Dependability Evaluation of Complex Hardware and Software Systems,” Proc. 23rd Int'l Symp. Fault-Tolerant Computing (FTCS-23) Digest of Papers, pp. 668-673, 1993.
[15] G. Krishnamurthi , A. Gupta , and A.K. Somani , “The HIMAP Modeling Environment,” Proc. Ninth Int'l Conf. Parallel and Distributed Computing Systems, pp. 254-259, 1996.
[16] C. Hirel , B. Tuffin , and K.S. Trivedi , “Spnp Version 6.0,” Lecture Notes in Computer Science, vol. 1786, pp. 354-357, 2000.
[17] K.J. Sullivan , J.B. Dugan , and D. Coppit , “The Galileo Fault Tree Analysis Tool,” Proc. 29th Ann. Int'l Symp. Fault-Tolerant Computing, pp. 232-235, 1999.
[18] M. Walter , M. Siegle , and A. Bode , “OpenSESAME—the Simple but Extensive, Structured Availability Modeling Environment,” Reliability Eng. and System Safety, vol. 93, pp. 857-873, 2008.
[19] M. Bouissou and J.L. Bon , “A New Formalism that Combines Advantages of Fault-Trees and Markov Models: Boolean Logic Driven Markov Processes,” Reliability Eng. and System Safety, vol. 82, pp. 149-163, 2003.
[20] A. Avizienis , J.-C. Laprie , B. Randell , and C. Landwehr , “Basic Concepts and Taxonomy of Dependable and Secure Computing,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 1, pp. 11-33, Jan.-Mar. 2004.
[21] W.G. Bouricius , W.C. Carter , and P.R. Schneider , “Reliability Modeling Techniques for Self-Repairing Computer Systems,” Proc. 24th ACM Nat'l Conf., pp. 295-309, 1969.
[22] J.B. Dugan and K.S. Trivedi , “Coverage Modeling for Dependability Analysis of Fault-Tolerant Systems,” IEEE Trans. Computers, vol. 28, no. 6, pp. 775-787, June 1989.
[23] H. Xu , L. Xing , and R. Robidoux , “DRBD: Dynamic Reliability Block Diagrams for System Reliability Modeling,” to be published in Int'l J. Computers and Applications, 2009.
[24] S. Distefano and L. Xing , “A New Approach to Modelling the System Reliability: Dynamic Reliability Block Diagrams,” Proc. 52nd Ann. Reliability and Maintainability Symp. (RAMSS '06), pp.189-195, 2006.
[25] H. Langseth and L. Portinale , “Bayesian Networks in Reliability,” Reliability Eng. and System Safety, vol. 92, pp. 92-108, 2007.
[26] M. Xie , Y.S. Dai , and K. Poh , Computing Systems Reliability: Models and Analysis. Kluwer Academic, 2004.
[27] T.H. Cormen , C.E. Leiserson , R.L. Rivest , and C. Stein , Introduction to Algorithms, second ed. McGraw-Hill, 2001.
[28] D.E. Knuth , The Art of Computer Programming: Fundamental Algorithms, third ed. Addison-Wesley, 1997.
[29] L.A. Barroso , J. Dean , and U. Hölzle , “Web Search for a Planet: The Google Cluster Architecture,” IEEE Micro, vol. 23, no. 2, pp.22-28, Mar./Apr. 2003.

Index Terms:
Availability, reliability modeling, Markov processes, trees, cascading failures.
Srinivasan M. Iyer, Marvin K. Nakayama, Alexandros V. Gerbessiotis, "A Markovian Dependability Model with Cascading Failures," IEEE Transactions on Computers, vol. 58, no. 9, pp. 1238-1249, Sept. 2009, doi:10.1109/TC.2009.31
Usage of this product signifies your acceptance of the Terms of Use.