This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Analysis and Modeling of Correlated Failures in Multicomputer Systems
May 1992 (vol. 41 no. 5)
pp. 567-577

Based on the measurements from two DEC VAX-cluster multicomputer systems, the issue of correlated failures is addressed. In particular, the characteristics of correlated failures, their impact and their modelling on dependability, are discussed. It is found from the data that most correlated failures are related to errors in shared resources and propagate from one machine to another. Comparisons between measurement-based models and analytical models that assume failure independence show that the impact of correlated failures on dependability is significant. Two validated models. the c-dependent model and the p-dependent model, are developed to evaluate the dependability of systems with correlated failures.

[1] T. F. Arnold, "The concept of coverage and its effect on the reliability model of a repairable system,"IEEE Trans. Comput., vol. C-22, pp. 251-154, Mar. 1973.
[2] B. E. Aupperle, J. F. Meyer, and L. Wei, "Evaluation of fault-tolerant systems with nonhomogeneous workloads," inProc. 19th Int. Symp. Fault-Tolerant Comput., June 1989, pp. 159-166.
[3] S. E. Butner and R. K. Iyer, "A statistical study of reliability and system load at SLAC," inProc. 10th Int. Symp. Fault-Tolerant Comput., Oct. 1980, pp. 207-209.
[4] X. Castillo and D. P. Siewiorek, "Workload, performance, and reliability of digital computer systems," inProc. 11th Int. Symp. Fault-Tolerant Comput., June 1981, pp. 84-89.
[5] X. Castillo and D. P. Siewiorek, "A workload dependent software reliability prediction model" inProc. 12th Int. Symp. Fault-Tolerant Comput., June 1982, pp. 279-286.
[6] Digital Equipment Corporation,VAXcluster Systems Handbook, Apr. 1986.
[7] J. B. Dugan, "Correlated hardware failures in redundant systems," inProc. 2nd IFIP Working Conf. Dependable Comput. for Critical Appl., Tucson, AZ, Feb. 1991.
[8] J. Dunkel, "On the modeling of workload-dependent memory faults," inProc. 20th Int. Symp. Fault-Tolerant Comput., June 1990, pp. 348-355.
[9] A. Goyalet al., "The system availability estimator," inProc. 16th Int. Symp. Fault-Tolerant Comput., June 1986, pp. 84-89.
[10] A. J. Gross and V. A. Clark,Survival Distributions: Reliability Applications in the Biomedical Sciences, New York: Wiley, 1975.
[11] D. I. Heimann, N. Mittal, and K. S. Trivedi, "Availability and reliability modeling for computer systems,"Advances in Comput., vol. 31, pp. 175-233, 1990.
[12] R. V. Hogg and E. A. Tanis,Probability and Statistical Inference, second ed. New York: Macmillan, 1983.
[13] M. C. Hsueh, R. K. Iyer, and K. S. Trivedi "Performability modeling based on real data: A case study,"IEEE Trans. Comput., vol. 37, pp. 478-484, Apr. 1988.
[14] R. K. Iyer, S. E. Butner, and E. J. McCluskey, "A statistical failure/load relationship: Results of a multicomputer study,"IEEE Trans. Comput., vol. C-31, pp. 697-705, July 1982.
[15] C. M. Krishna and A. D. Singh, "Modeling correlated transient failures in fault-tolerant systems," inProc. 19th Int. Symp. Fault-Tolerant Comput., June 1989, pp: 374-381.
[16] N. P. Kronenberg, H. M. Levy, and W. D. Strecker, "VAXcluster: A closely-coupled distributed system,"ACM Trans. Comput. Syst., vol. 4., no. 2, pp. 130-146, May 1986.
[17] I. Lee, R. K. Iyer, and D. Tang "Error/failure analysis using event logs from fault tolerant systems," inProc. 21st Int. Symp. Fault-Tolerant Comput., June 1991, pp. 10-17.
[18] J. F. Meyer and L. Wei, "Analysis of workload influence on dependability," inProc. 18th IEEE Int. Symp. on Fault Tolerant Computing (FTCS-18)(Tokyo), June 1988, pp. 84-89.
[19] R. A. Sahner and K. S. Trivedi, "Reliability modeling using SHARPE,"IEEE Trans. Reliability, vol. R-36, pp. 186-193, June 1987.
[20] D. Tang and R. K. Iyer, "Dependability measurement and modeling of a multicomputer system,"IEEE Trans. Comput., to be published.
[21] K. S. Trivedi,Probability and Statistics with Reliability, Queueing and Computer Science Applications. Englewood Cliffs, NJ: Prentice-Hall, 1982.

Index Terms:
correlated failures; multicomputer systems; DEC VAX-cluster; dependability; shared resources; c-dependent model; p-dependent model; computation theory; fault tolerant computing; multiprocessing systems.
Citation:
D. Tang, R.K. Iyer, "Analysis and Modeling of Correlated Failures in Multicomputer Systems," IEEE Transactions on Computers, vol. 41, no. 5, pp. 567-577, May 1992, doi:10.1109/12.142683
Usage of this product signifies your acceptance of the Terms of Use.