International Conference on Dependable Systems and Networks (DSN'06) BlueGene/L Failure Analysis and Prediction Models Philadelphia, Pennsylvania June 25-June 28 ISBN: 0-7695-2607-1
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DSN.2006.18
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM?s BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Ear- lier work has shown that conventional runtime fault- tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure predic- tion has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a pe- riod of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures.
Citation:
Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, Ramendra Sahoo, "BlueGene/L Failure Analysis and Prediction Models," dsn, pp.425-434, International Conference on Dependable Systems and Networks (DSN'06), 2006 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||