This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Dependability Measurement and Modeling of a Multicomputer System
January 1993 (vol. 42 no. 1)
pp. 62-75

A measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system is presented. Basic system dependability characteristics such as error/failure distributions and hazard rate are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of error/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck. It is shown that for measured system, the homogeneous Markov model, which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated across machines and across time. The failure correlation coefficient is low. However, its effect on system unavailability is significant.

[1] B. E. Aupperle, J. F. Meyer, and L. Wei, "Evaluation of fault-tolerant systems with nonhomogeneous workloads," inProc. 19th Int. Symp. Fault-Tolerant Comput., June 1989, pp. 159-166.
[2] E. E. Balkovichet al., "VAXcluster availability modeling,"Digital Tech. J., no. 5, pp. 69-79, Sept. 1987.
[3] M. D. Beaudry, "Performance-related reliability measures for computing systems,"IEEE Trans. Comput., vol. C-27, no. 6, pp. 540-547, June 1978.
[4] S. E. Butner and R. K. Iyer, "A statistical study of reliability and system load at SLAC," inProc. 10th Int. Symp. Fault-Tolerant Comput., Oct. 1980, pp. 207-209.
[5] X. Castillo and D. P. Siewiorek, "Workload, performance, and reliability of digital computer systems," inProc. 11th Int. Symp. Fault-Tolerant Comput., July 1981, pp. 84-89.
[6] X. Castillo and D. P. Siewiorek, "A workload dependent software reliability prediction model," inProc. 12th Int. Symp. Fault-Tolerant Comput., June 1982, pp. 279-286.
[7] G. Ciardo and K. S. Trivedi, "A decomposition approach for stochastic Petri net models," inProc. Fifth Int. Conf. Petri Net Models, Dec. 1991.
[8] Digital Equipment Corp.,VAXcluster Systems Handbook, Apr. 1986.
[9] J. Dunkel, "On the modeling of workload-dependent memory faults," inProc. 20th Int. Symp. Fault-Tolerant Comput., June 1990, pp. 348-355.
[10] A. Goyal, S. S. Lavenberg, and K. S. Trivedi, "Probabilistic modeling of computer system availability,"Ann. Oper. Res., no. 8, pp. 285-306, Mar. 1987.
[11] D. I. Heimann, N. Mittal, and K. S. Trivedi, "Availability and reliability modeling for computer systems,"Advances in Comput., vol. 31, pp. 175-233, 1990.
[12] M. C. Hsueh, "Measurement-based reliability/performability models," Ph.D. dissertation, Dep. Comput. Sci., Univ. Illinois at Urbana-Champaign, Aug. 1987.
[13] O. C. Ibe, R. C. Howe, and K. S. Trivedi, "Approximate availability analysis of VAXcluster systems,"IEEE Trans. Reliability, vol. 38, no. 1, pp. 146-152, Apr. 1989.
[14] R. K. Iyer and D. J. Rossetti, "A statistical load dependency model for CPU errors at SLAC," inProc. 12th Int. Symp. Fault-Tolerant Comput., June 1982, pp. 363-372.
[15] R. K. Iyer and D. J. Rossetti, "Effect of system workload on operating system reliability: A study on IBM 3081,"IEEE Trans. Software Eng., vol. SE-11, no. 12, pp. 1438-1448, Dec. 1985.
[16] R. Iyer, D. Rossetti, and M. Hsueh, "Measurement and modeling of computer reliability as affected by system activity,"ACM Trans. Comput. Syst., vol. 4, pp. 214-237, Aug. 1986.
[17] R. K. Iyer, L. T. Young, and P. V. K. Iyer, "Automatic recognition of intermittent failures: An experimental study of field data,"IEEE Trans. Comput., vol. 39, no. 4, pp. 525-537, Apr. 1990.
[18] N. P. Kronenberget al., "The VAXcluster concept: An overview of a distributed system,"Digital Tech. J., no. 5, pp. 7-21, Sept. 1987.
[19] J. C. Laprie, "Dependable computing and fault tolerance: Concepts and terminology," inProc. 15th Int. Symp. Fault-Tolerant Comput., June 1985, pp. 2-11.
[20] T. T. Lin and D. P. Siewiorek, "Error log analysis: Statistical modeling and heuristic trend analysis,"IEEE Trans. Reliability, vol. R-39, no. 4, pp. 419-432 Oct. 1990.
[21] R. A. Maxion, "Anomaly detection for diagnosis," inProc. 20th Int. Symp. Fault-Tolerant Comput., June 1990, pp. 20-27.
[22] J. F. Meyer, "On evaluating the performability of degradable computing systems,"IEEE Trans. Comput., vol. C-29, no. 8, pp. 720-731, Aug. 1980.
[23] J. F. Meyer and L. Wei, "Analysis of workload influence on dependability," inProc. 18th IEEE Int. Symp. on Fault Tolerant Computing (FTCS-18)(Tokyo), June 1988, pp. 84-89.
[24] J. F. Meyer, "Performability: A retrospective and some pointers to the future,"Perform. Eval., vol. 14, pp. 139-156, Feb. 1992.
[25] A. Reibman, R. Smith, and K. Trivedi, "Markov and Markov reward model transient analysis: An overview of numerical approaches,"Euro. J. Oper. Res., vol. 40, pp. 257-267, 1989.
[26] S. M. Ross,Introduction to Probability Models, 3rd ed. New York: Academic, 1985.
[27] R. A. Sahner and K. S. Trivedi, "Reliability modeling using SHARPE,"IEEE Trans. Reliability, vol. R-36, no. 2, pp. 186-193, June 1987.
[28] R. M. Smith, K. S. Trivedi, and A. V. Ramesh, "Performability analysis: Measures, an algorithm, and a case study,"IEEE Trans. Comput., vol. 37, no. 4, pp. 406-417, Apr. 1988.
[29] K. S. Trivedi,Probability and Statistics with Reliability, Queueing and Computer Science Applications. Englewood Cliffs, NJ: Prentice-Hall, 1982.
[30] K. S. Trivedi and R. M. Geist, "Decomposition in reliability analysis of fault-tolerant systems,"IEEE Trans. Reliability, vol. R-32, no. 5, pp. 463-468, Dec. 1983.
[31] K. S. Trivedi, J. K. Muppala, S. P. Woolet, and B. R. Haverkort, "Composite performance and dependability analysis,"Perform. Eval., vol. 14, pp. 197-215, Feb. 1992.
[32] M. M. Tsao and D. P. Siewiorek, "Trend analysis on system error files," inProc. 13th Int. Symp. Fault-Tolerant Comput., June 1983, pp. 116-119.
[33] A. S. Wein and A. Sathaye, "Validating complex computer system availability models,"IEEE Trans. Reliability, vol. 39, no. 4, pp. 468-479, Oct. 1990.

Index Terms:
dependability measurement; correlation analysis; Markov reward models; modeling; multicomputer system; measurement-based analysis; error data; DEC VAXcluster; system dependability characteristics; hazard rate; performance loss; transient reward rate; system unavailability; fault tolerant computing; multiprocessing systems; performance evaluation.
Citation:
D. Tang, R.K. Iyer, "Dependability Measurement and Modeling of a Multicomputer System," IEEE Transactions on Computers, vol. 42, no. 1, pp. 62-75, Jan. 1993, doi:10.1109/12.192214
Usage of this product signifies your acceptance of the Terms of Use.