This Article 
 Bibliographic References 
 Add to: 
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing
November 2009 (vol. 58 no. 11)
pp. 1512-1524
Zizhong Chen, Colorado School of Mines, Golden
Jack Dongarra, University of Tennessee, Knoxville
As the number of processors in today's high-performance computers continues to grow, the mean-time-to-failure of these computers is becoming significantly shorter than the execution time of many current high-performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most of today's high-performance computing applications cannot survive node failures. Therefore, whenever a node fails, all surviving processes on surviving nodes usually have to be aborted and the whole application has to be restarted. In this paper, we present a framework for building self-healing high-performance numerical computing applications so that they can adapt to node or link failures without aborting themselves. The framework is based on FT-MPI and diskless checkpointing. Our diskless checkpointing uses weighted checksum schemes, a variation of Reed-Solomon erasure codes over floating-point numbers. We introduce several scalable encoding strategies into the existing diskless checkpointing and reduce the overhead to survive k failures in p processes from 2 \lceil \log p \rceil . k ((\beta + 2\gamma ) m + \alpha ) to (1 + O ({\sqrt{p}\over \sqrt{m}} ) )^2 . k (\beta + 2\gamma ) m, where \alpha is the communication latency, {1\over \beta } is the network bandwidth between processes, {1\over \gamma } is the rate to perform calculations, and m is the size of local checkpoint per process. When additional checkpoint processors are used, the overhead can be reduced to (1 + O ({1\over \sqrt{m}} ) ) . k (\beta + 2\gamma ) m, which is independent of the total number of computational processors. The introduced self-healing algorithms are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of our self-healing approach by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that our self-healing scheme can survive multiple simultaneous process failures with low-performance overhead and little numerical impact.

[1] N.R. Adiga et al. “An Overview of the BlueGene/L Supercomputer,” Proc. Supercomputing Conf. (SC '02), pp. 1-22, 2002.
[2] R. Barrett, M. Berry, T.F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H.V. der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, second ed. SIAM, 1994.
[3] F. Berman, G. Fox, and A. Hey, Grid Computing: Making the Global Infrastructure a Reality. Wiley, 2003.
[4] Z. Chen and J. Dongarra, “Numerically Stable Real Number Codes Based on Random Matrices,” Proc. Fifth Int'l Conf. Computational Science (ICCS '05), May 2005.
[5] Z. Chen and J. Dongarra, “Condition Numbers of Gaussian Random Matrices,” SIAM J. Matrix Analysis and Applications, vol. 27, no. 3, pp. 603-620, 2005.
[6] Z. Chen, J. Dongarra, P. Luszczek, and K. Roche, “Self-Adapting Software for Numerical Linear Algebra and LAPACK for Clusters,” Parallel Computing, vol. 29, nos. 11/12, pp. 1723-1743, Nov./Dec. 2003.
[7] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Fault Tolerant High Performance Computing by a Coding Approach,” Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '05), June 2005.
[8] T.C. Chiueh and P. Deng, “Evaluation of Checkpoint Mechanisms for Massively Parallel Machines,” Proc. 26th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '96), pp. 370-379, 1996.
[9] J. Dongarra, H. Meuer, and E. Strohmaier, “TOP500 Supercomputer Sites, 24th Edition,” Proc. Supercomputing Conf. (SC'2004), 2004.
[10] A. Edelman, “Eigenvalues and Condition Numbers of Random Matrices,” SIAM J. Matrix Analysis and Applications, vol. 9, no. 4, pp. 543-560, 1988.
[11] G.E. Fagg and J. Dongarra, “FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World,” Proc. Parallel Virtual Machine/Message Passing Interface Conf. (PVM/MPI '00), pp. 346-353, 2000.
[12] G.E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J.J. Dongarra, “Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems,” Proc. Int'l Supercomputer Conf., 2004.
[13] G.E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J.J. Dongarra, “Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing,” Int'l J. High Performance Computing Applications, vol. 19, no. 4, pp. 465-477, 2005.
[14] I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure. Morgan Kauffman, 1999.
[15] E. Gelenbe, “On the Optimum Checkpoint Interval,” J. ACM, vol. 26, no. 2, pp. 259-270, 1979.
[16] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard,” Parallel Computing, vol. 22, no. 6, pp. 789-828, Sept. 1996.
[17] G.H. Golub and C.F. Van Loan, Matrix Computations. The Johns Hopkins Univ. Press, 1989.
[18] Y. Kim, “Fault Tolerant Matrix Operations for Parallel and Distributed Systems,” PhD dissertation, Univ. of Tennessee, June 1996.
[19] Message Passing Interface Forum “MPI: A Message Passing Interface Standard,” Technical Report ut-cs-94-230, Univ. of Tennessee, 1994.
[20] J.S. Plank, “A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-Like Systems,” Software—Practice & Experience, vol. 27, no. 9, pp. 995-1012, Sept. 1997.
[21] J.S. Plank, Y. Kim, and J. Dongarra, “Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing,” J. Parallel and Distributed Computing, vol. 43, no. 2, pp.125-138, 1997.
[22] J.S. Plank and K. Li, “Faster Checkpointing with $n+1$ Parity,” Proc. Int'l Symp. Fault-Tolerant Computing (FTCS), pp. 288-297, 1994.
[23] J.S. Plank, K. Li, and M.A. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972-986, Oct. 1998.
[24] J.S. Plank and M.G. Thomason, “Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems,” J.Parallel and Distributed Computing, vol. 61, no. 11, pp. 1570-1590, Nov. 2001.
[25] L.M. Silva and J.G. Silva, “An Experimental Study about Diskless Checkpointing,” Proc. EUROMICRO '98 Conf., pp. 395-402, 1998.
[26] N.H. Vaidya, “A Case for Two-Level Recovery Schemes,” IEEE Trans. Computers, vol. 47, no. 6, pp. 656-666, June 1998.
[27] J.W. Young, “A First Order Approximation to the Optimal Checkpoint Interval,” Comm. ACM, vol. 17, no. 9, pp. 530-531, 1974.

Index Terms:
Self-healing, diskless checkpointing, fault tolerance, pipeline, parallel and distributed systems, high-performance computing, Message Passing Interface.
Zizhong Chen, Jack Dongarra, "Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing," IEEE Transactions on Computers, vol. 58, no. 11, pp. 1512-1524, Nov. 2009, doi:10.1109/TC.2009.42
Usage of this product signifies your acceptance of the Terms of Use.