|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
| ASCII Text | x | ||
| Bianca Schroeder, Garth A. Gibson, "A Large-Scale Study of Failures in High-Performance Computing Systems," IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 4, pp. 337-351, October-December, 2010. | |||
| BibTex | x | ||
| @article{ 10.1109/TDSC.2009.4, author = {Bianca Schroeder and Garth A. Gibson}, title = {A Large-Scale Study of Failures in High-Performance Computing Systems}, journal ={IEEE Transactions on Dependable and Secure Computing}, volume = {7}, number = {4}, issn = {1545-5971}, year = {2010}, pages = {337-351}, doi = {http://doi.ieeecomputersociety.org/10.1109/TDSC.2009.4}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - JOUR JO - IEEE Transactions on Dependable and Secure Computing TI - A Large-Scale Study of Failures in High-Performance Computing Systems IS - 4 SN - 1545-5971 SP337 EP351 EPD - 337-351 A1 - Bianca Schroeder, A1 - Garth A. Gibson, PY - 2010 KW - Large-scale systems KW - high-performance computing KW - supercomputing KW - reliability KW - failures KW - node outages KW - field study KW - empirical study KW - repair time KW - time between failures KW - root cause. VL - 7 JA - IEEE Transactions on Dependable and Secure Computing ER - | |||
[1] The raw data and more information is available at the following two URLs: http://www.pdl.cmu.eduFailureData/ and http://www.lanl.gov/projects/computerscience data/, 2006.
[2] X. Castillo and D. Siewiorek, "Workload, Performance, and Reliability of Digital Computing Systems," Proc. Int'l Symp. Fault-Tolerant Computing (FTCS-11), 1981.
[3] J. Gray, "Why do Computers Stop and What Can be Done About It," Proc. Fifth Symp. Reliability in Distributed Software and Database Systems, 1986.
[4] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Trans. Reliability, vol. 39, no. 4, pp. 409-418, Oct. 1990.
[5] T. Heath, R.P. Martin, and T.D. Nguyen, "Improving Cluster Availability Using Workstation Validation," Proc. Assoc. Computing Machinery SIGMETRICS, 2002.
[6] R.K. Iyer, D.J. Rossetti, and M.C. Hsueh, "Measurement and Modeling of Computer Reliability as Affected by System Activity," ACM Trans. Computer Systems, vol. 4, no. 3, 1986.
[7] M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, "Failure Data Analysis of a LAN of Windows NT based computers," Proc. Symp. Reliability in Distributed Software (SRDS)-18, 1999.
[8] G.P. Kavanaugh and W.H. Sanders, "Performance Analysis of Two Time-Based Coordinated Checkpointing Protocols," Proc. Pacific Rim Int'l Symp. Fault-Tolerant Systems, 1997.
[9] T.-T.Y. Lin and D.P. Siewiorek, "Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis," IEEE Trans. on Reliability, vol. 39, no. 4, pp. 419-432, Oct. 1990.
[10] D. Long, A. Muir, and R. Golding, "A Longitudinal Survey of Internet Host Reliability," Proc. Symp. Reliability in Distributed Software (SRDS)-14, 1995.
[11] J. Meyer and L. Wei, "Analysis of Workload Influence on Dependability," Proc. Int'l Symp. Fault-Tolerant Computing (FTCS), 1988.
[12] B. Mullen and D.R., "Lifecycle Analysis Using Software Defects Per Million (SWDPM)," Proc. 16th int'l Symp. Software Reliability (ISSRE'05), 2005.
[13] B. Murphy and T. Gent, "Measuring System and Software Reliability Using an Automated Data Collection Process," Quality and Reliability Eng. Int'l, vol. 11, no. 5, 1995.
[14] S. Nath, H. Yu, P.B. Gibbons, and S. Seshan, "Subtleties in Tolerating Correlated Failures," Proc. Symp. Networked Systems Design and Implementation (NSDI'06), 2006.
[15] D. Nurmi, J. Brevik, and R. Wolski, "Modeling Machine Availability in Enterprise and Wide Area Distributed Computing Environments," Proc. European Conf. Parallel Computing (Euro-Par '05), 2005.
[16] D.L. Oppenheimer, A. Ganapathi, and D.A. Patterson, "Why do Internet Services Fail, and What Can be Done About It?" Proc. USENIX Symp. Internet Technologies and Systems, 2003.
[17] J.S. Plank and W.R. Elwasif, "Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems," Proc. Int'l Symp. Fault-Tolerant Computing (FTCS '98), 1998.
[18] S.M. Ross, Introduction to Probability Models, Academic Press, 1997.
[19] R.K. Sahoo, A. Sivasubramaniam, M.S. Squillante, and Y. Zhang, "Failure Data Analysis of A Large-Scale Heterogeneous Server Environment," Proc. Dependable Systems and Networks (DSN '04), 2004.
[20] B. Schroeder and G.A. Gibson, "A Large Scale Study of Failures in High-Performance-Computing Systems," Proc. Dependable Systems and Networks (DSN '06), 2006.
[21] B. Schroeder and G.A. Gibson, "Disk Failures in the Real World: What Does An MTTF of 1,000,000 Hours Mean to You?" Proc. Fifth Usenix Conf. File and Storage Technologies (FAST '07), 2007.
[22] D. Tang, R.K. Iyer, and S.S. Subramani, "Failure Analysis and Modelling of A VAX Cluster System," Proc. Int'l Symp. Fault-Tolerant Computing (FTCS), 1990.
[23] T. Tannenbaum and M. Litzkow, "The Condor Distributed Processing System," Dr. Dobbs J., 1995.
[24] N.H. Vaidya, "A Case For Two-Level Distributed Recovery Schemes," Proc. ACM SIGMETRICS, 1995.
[25] W. Willinger, M.S. Taqqu, R. Sherman, and D.V. Wilson, "Self-Similarity Through High-Variability: Statistical Analysis of Ethernet LAN Traffic at the Source Level," IEEE/ACM Trans. Networking, vol. 5, no. 1, pp. 71-86, 1997.
[26] K.F. Wong and M. Franklin, "Checkpointing in Distributed Computing Systems," J. Parallel and Distributed Computing, vol. 35, no. 1, pp. 67-75, May 1996.
[27] J. Xu, Z. Kalbarczyk, and R.K. Iyer, "Networked Windows NT System Field Failure Data Analysis," Proc. 1999 Pacific Rim Int'l Symp. Dependable Computing, 1999.
[28] Y. Zhang, M.S. Squillante, A. Sivasubramaniam, and R.K. Sahoo, "Performance Implications of Failures in Large-Scale Cluster Scheduling," Proc. 10th Workshop Job Scheduling Strategies for Parallel Processing, 2004.

