The Community for Technology Leaders
RSS Icon
Issue No.10 - October (2009 vol.20)
pp: 1471-1486
Xuejun Yang , National University of Defense Technology, Changsha
Yunfei Du , National University of Defense Technology, Changsha
Panfeng Wang , National University of Defense Technology, Changsha
Hongyi Fu , National University of Defense Technology, Changsha
Jia Jia , National University of Defense Technology, Changsha
As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the Fault-Tolerant Parallel Algorithm (FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed Get it Fault-Tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach.
Fault tolerance, fault-tolerant parallel algorithm, fast self-recovery, parallel recomputing.
Xuejun Yang, Yunfei Du, Panfeng Wang, Hongyi Fu, Jia Jia, "FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 10, pp. 1471-1486, October 2009, doi:10.1109/TPDS.2008.231
[1] IBM Roadrunner, http:/, 2008.
[2] D.A. Reed, C. da Lu, and C.L. Mendes, “Reliability Challenges in Large Systems,” Future Generation Computer Systems, vol. 22, no. 3, pp. 293-302, 2006.
[3] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov, “MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes,” Proc. ACM/IEEE Conf. Supercomputing (Supercomputing '02), pp. 1-18, 2002.
[4] E.N. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
[5] A.N. Norman, C. Lin, and S.-E. Choi, “Compiler-Generated Staggered Checkpointing,” Proc. Seventh ACM Workshop Languages, Compilers, and Runtime Support for Scalable Systems (LCR'04), pp. 1-8, Oct. 2004.
[6] S. Chakravorty and L.V. Kale, “A Fault Tolerance Protocol with Fast Fault Recovery,” Proc. 21st IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '07), pp. 120-128, Mar. 2007.
[7] X. Yang, Y. Du, P. Wang, H. Fu, J. Jia, Z. Wang, and G. Suo, “The Fault Tolerant Parallel Algorithm: The Parallel Recomputing Based Failure Recovery,” Proc. 16th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '07), pp. 199-209, 2007.
[8] E.N. Elnozahy and J.S. Plank, “Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, pp. 97-108, 2004.
[9] A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, “MPICH-V: A Multiprotocol Fault Tolerant MPI,” Int'l J. High Performance Computing and Applications, vol. 20, no. 3, pp. 319-333, 2006.
[10] J.S. Plank, Y. Kim, and J. Dongarra, “Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing,” J. Parallel and Distributed Computing, vol. 43, no. 2, pp.125-138, 1997.
[11] J.S. Plank, K. Li, and M.A. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp.972-986, 1998.
[12] A. Beguelin, E. Seligman, and P. Stephan, “Application Level Fault Tolerance in Heterogeneous Networks of Workstations,” J.Parallel and Distributed Computing, vol. 43, no. 2, pp. 147-155, 1997.
[13] G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill, “Automated Application-Level Checkpointing of MPI Programs,” Proc. 16th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '03), pp. 84-94, 2003.
[14] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Fault Tolerant High Performance Computing by a Coding Approach,” Proc. 17th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '05), pp.213-223, June 2005.
[15] X. Yang, P. Wang, H. Fu, Y. Du, Z. Wang, and J. Jia, “Compiler-Assisted Application-Level Checkpointing for MPI Programs,” Proc. 28th Int'l Conf. Distributed Computing Systems (ICDCS '08), June 2008.
[16] C. Engelmann and G.A. Geist, “Super-Scalable Algorithms for Computing on 100,000 Processors,” Proc. Fifth Int'l Conf. Computational Science (ICCS '05), Part I, pp. 313-320, May 2005.
[17] G. Bosilca, Z. Chen, J. Langou, and J. Dongarra, “Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,” Technical Report UT-CS-04-538, Univ. of Tennessee, 2004.
[18] P.D. Hough, T.G. Kolda, and V. Torczon, “Asynchronous Parallel Pattern Search for Nonlinear Optimization,” SIAM J. Scientific Computing, vol. 23, no. 1, pp. 134-156, June 2001.
[19] R.L. Graham, S.-E. Choi, D.J. Daniel, N.N. Desai, R.G. Minnich, C.E. Rasmussen, L.D. Risinger, and M.W. Sukalski, “A Network-Failure-Tolerant Message-Passing System for Terascale Clusters,” Int'l J. Parallel Programming, vol. 31, no. 4, pp. 285-303, Aug. 2003.
[20] G.M. Shipman, R.L. Graham, and G. Bosilca, “Network Fault Tolerance in Open MPI,” Proc. 13th Ann. Euro-Par Conf. (Euro-Par '07), pp. 868-878, Aug. 2007.
[21] A.V. Aho, R. Sethi, and J.D. Ullman, Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986.
[22] M.M. Strout, B. Kreaseck, and P.D. Hovland, “Data-Flow Analysis for MPI Programs,” Proc. Int'l Conf. Parallel Processing (ICPP '06), pp. 175-184, 2006.
[23] M.J. Harrold and M.L. Soffa, “Efficient Computation of Interprocedural Definition-Use Chains,” ACM Trans. Programming Languages and Systems, vol. 16, no. 2, pp. 175-204, 1994.
[24] OpenMP Application Program Interface, http:/, 2008.
[25] D. Bailey, T. Harris, W. Saphir, R. van der Wijngaart, A. Woo, and M. Yarrow, “The NAS Parallel Benchmarks 2.0,” Technical Report NAS-95-020, NASA Ames Research Center, 1995.
[26] T.-H. Tzen and L.M. Ni, “Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers,” IEEE Trans. Parallel and Distributed Systems, vol. 4, pp. 87-98, Jan. 1993.
[27] J. Li and M. Chen, “Generating Explicit Communication from Shared-Memory Program References,” Proc. ACM/IEEE Conf. Supercomputing, Nov. 1990.
17 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool