From the June 2012 Issue
The Reliability Wall for Exascale Supercomputing
By Xuejun Yang, Zhiyuan Wang, Jingling Xue, & Yun Zhou
Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault-tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of "Reliability Wall” to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance.
View the PDF of this article
View this issue in the digital library