This Article 
 Bibliographic References 
 Add to: 
A Large-Scale Study of Failures in High-Performance Computing Systems
October-December 2010 (vol. 7 no. 4)
pp. 337-351
Bianca Schroeder, University of Toronto, Toronto
Garth A. Gibson, Carnegie Mellon University, Pittsburgh
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.

[1] The raw data and more information is available at the following two URLs: http://www.pdl.cmu.eduFailureData/ and data/, 2006.
[2] X. Castillo and D. Siewiorek, "Workload, Performance, and Reliability of Digital Computing Systems," Proc. Int'l Symp. Fault-Tolerant Computing (FTCS-11), 1981.
[3] J. Gray, "Why do Computers Stop and What Can be Done About It," Proc. Fifth Symp. Reliability in Distributed Software and Database Systems, 1986.
[4] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Trans. Reliability, vol. 39, no. 4, pp. 409-418, Oct. 1990.
[5] T. Heath, R.P. Martin, and T.D. Nguyen, "Improving Cluster Availability Using Workstation Validation," Proc. Assoc. Computing Machinery SIGMETRICS, 2002.
[6] R.K. Iyer, D.J. Rossetti, and M.C. Hsueh, "Measurement and Modeling of Computer Reliability as Affected by System Activity," ACM Trans. Computer Systems, vol. 4, no. 3, 1986.
[7] M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, "Failure Data Analysis of a LAN of Windows NT based computers," Proc. Symp. Reliability in Distributed Software (SRDS)-18, 1999.
[8] G.P. Kavanaugh and W.H. Sanders, "Performance Analysis of Two Time-Based Coordinated Checkpointing Protocols," Proc. Pacific Rim Int'l Symp. Fault-Tolerant Systems, 1997.
[9] T.-T.Y. Lin and D.P. Siewiorek, "Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis," IEEE Trans. on Reliability, vol. 39, no. 4, pp. 419-432, Oct. 1990.
[10] D. Long, A. Muir, and R. Golding, "A Longitudinal Survey of Internet Host Reliability," Proc. Symp. Reliability in Distributed Software (SRDS)-14, 1995.
[11] J. Meyer and L. Wei, "Analysis of Workload Influence on Dependability," Proc. Int'l Symp. Fault-Tolerant Computing (FTCS), 1988.
[12] B. Mullen and D.R., "Lifecycle Analysis Using Software Defects Per Million (SWDPM)," Proc. 16th int'l Symp. Software Reliability (ISSRE'05), 2005.
[13] B. Murphy and T. Gent, "Measuring System and Software Reliability Using an Automated Data Collection Process," Quality and Reliability Eng. Int'l, vol. 11, no. 5, 1995.
[14] S. Nath, H. Yu, P.B. Gibbons, and S. Seshan, "Subtleties in Tolerating Correlated Failures," Proc. Symp. Networked Systems Design and Implementation (NSDI'06), 2006.
[15] D. Nurmi, J. Brevik, and R. Wolski, "Modeling Machine Availability in Enterprise and Wide Area Distributed Computing Environments," Proc. European Conf. Parallel Computing (Euro-Par '05), 2005.
[16] D.L. Oppenheimer, A. Ganapathi, and D.A. Patterson, "Why do Internet Services Fail, and What Can be Done About It?" Proc. USENIX Symp. Internet Technologies and Systems, 2003.
[17] J.S. Plank and W.R. Elwasif, "Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems," Proc. Int'l Symp. Fault-Tolerant Computing (FTCS '98), 1998.
[18] S.M. Ross, Introduction to Probability Models, Academic Press, 1997.
[19] R.K. Sahoo, A. Sivasubramaniam, M.S. Squillante, and Y. Zhang, "Failure Data Analysis of A Large-Scale Heterogeneous Server Environment," Proc. Dependable Systems and Networks (DSN '04), 2004.
[20] B. Schroeder and G.A. Gibson, "A Large Scale Study of Failures in High-Performance-Computing Systems," Proc. Dependable Systems and Networks (DSN '06), 2006.
[21] B. Schroeder and G.A. Gibson, "Disk Failures in the Real World: What Does An MTTF of 1,000,000 Hours Mean to You?" Proc. Fifth Usenix Conf. File and Storage Technologies (FAST '07), 2007.
[22] D. Tang, R.K. Iyer, and S.S. Subramani, "Failure Analysis and Modelling of A VAX Cluster System," Proc. Int'l Symp. Fault-Tolerant Computing (FTCS), 1990.
[23] T. Tannenbaum and M. Litzkow, "The Condor Distributed Processing System," Dr. Dobbs J., 1995.
[24] N.H. Vaidya, "A Case For Two-Level Distributed Recovery Schemes," Proc. ACM SIGMETRICS, 1995.
[25] W. Willinger, M.S. Taqqu, R. Sherman, and D.V. Wilson, "Self-Similarity Through High-Variability: Statistical Analysis of Ethernet LAN Traffic at the Source Level," IEEE/ACM Trans. Networking, vol. 5, no. 1, pp. 71-86, 1997.
[26] K.F. Wong and M. Franklin, "Checkpointing in Distributed Computing Systems," J. Parallel and Distributed Computing, vol. 35, no. 1, pp. 67-75, May 1996.
[27] J. Xu, Z. Kalbarczyk, and R.K. Iyer, "Networked Windows NT System Field Failure Data Analysis," Proc. 1999 Pacific Rim Int'l Symp. Dependable Computing, 1999.
[28] Y. Zhang, M.S. Squillante, A. Sivasubramaniam, and R.K. Sahoo, "Performance Implications of Failures in Large-Scale Cluster Scheduling," Proc. 10th Workshop Job Scheduling Strategies for Parallel Processing, 2004.

Index Terms:
Large-scale systems, high-performance computing, supercomputing, reliability, failures, node outages, field study, empirical study, repair time, time between failures, root cause.
Bianca Schroeder, Garth A. Gibson, "A Large-Scale Study of Failures in High-Performance Computing Systems," IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 4, pp. 337-351, Oct.-Dec. 2010, doi:10.1109/TDSC.2009.4
Usage of this product signifies your acceptance of the Terms of Use.