2011 IEEE International Conference on Cluster Computing (2011)
Austin, Texas USA
Sept. 26, 2011 to Sept. 30, 2011
With the fast improvement in technology, we are now moving toward exascale computing. Many experts predict that exascale computers will have millions of nodes, billions of threads of execution, hundreds of petabytes of inner memory and exabytes of persistent storage. For systems of such a scale, frequent failures are becoming a serious concern. One of the most important reasons is that in a large-scale system it is hard to detect failures. As a result, failure repair may take substantial time. In this paper, we investigate the effect of delayed repairing on two popular types of high-performance computing systems: IBM Blue Gene/P and general cluster. We analyze how delayed failure repairing will affect the performance of jobs when some computing units are at fault but not fixed in time. Our study is based on real workload traces and RAS logs collected from production supercomputing systems. Our Trace-based simulations indicate that fast failure detection and recovery is essential for moving towards petascale and beyond computing.
performance impact, resource management, delayed failure repairing
Z. Zheng, W. Tang, Z. Lan, N. Desai and Z. Zhou, "Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems," 2011 IEEE International Conference on Cluster Computing(CLUSTER), Austin, Texas USA, 2011, pp. 532-536.