This Article 
 Bibliographic References 
 Add to: 
Athanasia: A User-Transparent and Fault-Tolerant System for Parallel Applications
Oct. 2011 (vol. 22 no. 10)
pp. 1653-1668
Hyungsoo Jung, The University of Sydney, Sydney
Hyuck Han, Seoul National University, Seoul
Heon Y. Yeom, Seoul National University, Seoul
Sooyong Kang, Hanyang University, Seoul
This article presents Athanasia, a user-transparent and fault-tolerant system, for parallel applications running on large-scale cluster systems. Cluster systems have been regarded as a de facto standard to achieve multitera-flop computing power. These cluster systems, as we know, have an inherent failure factor that can cause computation failure. The reliability issue in parallel computing systems, therefore, has been studied for a relatively long time in the literature, and we have seen many theoretical promises arise from the extensive research. However, despite the rigorous studies, practical and easily deployable fault-tolerant systems have not been successfully adopted commercially. Athanasia is a user-transparent checkpointing system for a fault-tolerant Message Passing Interface (MPI) implementation that is primarily based on the sync-and-stop protocol. Athanasia supports three critical functionalities that are necessary for fault tolerance: a light-weight failure detection mechanism, dynamic process management that includes process migration, and a consistent checkpoint and recovery mechanism. The main features of Athanasia are that it does not require any modifications to the application code and that it preserves many of the high performance characteristics of high-speed networks. Experimental results show that Athanasia can be a good candidate for practically deployable fault-tolerant systems in very-large and high-performance clusters and that its protocol can be applied to a variety of parallel communication libraries easily.

[1] P.H. Hargrove and J.C. Duell, "Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters," Proc. Scientific Discovery through Advanced Computing Program (SciDAC), 2006.
[2] L. Lamport, R. Shostak, and M. Pease, "The Byzantine Generals Problem," ACM Trans. Programming Languages and Systems, vol. 4, no. 3, pp. 382-401, 1982.
[3] L. Lamport, "Paxos Made Simple," ACM SIGACT News, vol. 32, no. 4, pp. 18-25, 2001.
[4] H. Jung, D. Shin, H. Han, J.W. Kim, and H.Y. Yeom, "Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (${M}^3$ )," Proc. ACM/IEEE Conf. Supercomputing (SC '05), 2005.
[5] J.S. Plank, "Efficient Checkpointing on Mimd Architectures," PhD thesis, Princeton Univ., Jan. 1993.
[6] G. Bronevetsky, "Portable Checkpointing for Parallel Applications," PhD thesis, Cornell Univ., Jan. 2007.
[7] N. Woo, H. Jung, H.Y. Yeom, T. Park, and H. Park, "MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes," IEICE Trans. Information and Systems, vol. E87-D, no. 7, pp. 1820-1828, July 2004.
[8] N. Woo, H. Jung, D. Shin, H. Han, H.Y. Yeom, and T. Park, "Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF," Proc. European Dependable Computing Conf. (EDCC '05), 2005.
[9] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov, "MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes," Proc. ACM/IEEE Conf. Supercomputing (SC '02), 2002.
[10] E.N. Elnozahy, D.B. Johnson, and Y.M. Wang, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," ACM Computing Surveys, vol. 34, no. 3, pp. 378-408, 2002.
[11] S. Chakravorty and L.V. Kale, "A Fault Tolerant Protocol for Massively Parallel Systems," Proc. 18th Int'l Parallel and Distributed Processing Symp., 2004.
[12] G. Zheng, L. Shi, and L.V. Kale, "FTC-Charm++: An in-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and mpi," Proc. IEEE Conf. Cluster Computing, 2004.
[13] A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, and F.M.P. Lemarinier, "MPICH-V2: A Fault Tolerant MPI for Volatile Nodes Based on Pessimistic Sender Based Message Logging," Proc. ACM/IEEE Conf. Supercomputing (SC '03), Nov. 2003.
[14] A. Bouteiller, P. Lemarinier, T. Herault, G. Krawezik, and F. Cappello, "Improved Message Logging versus Improved Coordinated Checkpointing for Fault Tolerant mpi," Proc. IEEE Conf. Cluster Computing, 2004.
[15] D. Buntinas, C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, and F. Cappello, "MPICH-PCL: Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI," Proc. IEEE/ACM Supercomputing (SC '06), 2006.
[16] W. Huang, Q. Gao, J. Liu, and D. Panda, "High Performance Virtual Machine Migration with RDMA over Modern Interconnects," Proc. IEEE Conf. Cluster Computing, 2007.
[17] W. Huang, M. Koop, Q. Gao, and D. Panda, "Virtual Machine Aware Communication Libraries for High Performance Computing," Proc. IEEE/ACM Supercomputing (SC '07), 2007.
[18] G.E. Fagg and J.J. Dongarra, "FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World," Proc. EuroPVM-MPI '00, pp. 346-353, 2000.
[19] W.-J. Li and J.-J. Tsay, "Checkpointing Message-Passing Interface (MPI) Parallel Programs," Proc. Pacific Rim Int'l Symp. Fault-Tolerant Systems (PRFTS), pp. 147-152, 1997.
[20] S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou, "Portable Fault Tolerance Scheme for MPI," Parallel Processing Letters, vol. 10, no. 4, pp. 371-382, 2000.
[21] R. Batchu, A. Skjellum, Z. Cui, M. Beddhu, J. Neelamegam, Y. Dandass, and M. Apte, "MPI/FT: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing," Proc. First Int'l Symp. Cluster Computing and the Grid (CCGrid '01), 2001.
[22] A. Nguyen-Tuong, "Integrating Fault-Tolerance Techniques into Grid Applications," PhD dissertation, Univ. of Virginia, 2000.
[23] G. Stellner, "Cocheck: Checkpointing and Process Migration for mpi," Proc. Int'l Parallel Processing Symp., 1996.
[24] A. Agbaria and R. Friedman, "Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations," Proc. IEEE Symp. High Performance Distributed Computing, 1999.
[25] S. Rao, L. Alvisi, and H.M. Vin, "Egida: An Extensible Toolkit for Low-Overhead Fault-Tolerance," Proc. Symp. Fault-Tolerant Computing, 1999.
[26] S.H. Russ, J. Robinson, B.K. Flachs, and B. Heckel, "The Hector Distributed Run-Time Environment," IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 11, pp. 1102-1114, Nov. 1998.
[27] N. Neves and W.K. Fuchs, "Renew: A Tool for Fast and Efficient Implementation of Checkpoint Protocols," Proc. 27th IEEE Fault-Tolerant Computing Symp., June 1998.
[28] K.P.S.M. Greg Bronevetsky, D. Marques, and R. Rugina, "Compiler-Enhanced Incremental Checkpointing for Openmp Applications," Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS '09), 2009.
[29] K.A. Jason Ansel and G. Cooperman, "Dmtcp: Transparent Checkpointing for Cluster Computations and the Desktop," Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS '09), 2009.
[30] C.-L.W. Justin, C.Y. Ho, and F.C.M. Lau, "Scalable Group-Based Checkpoint/Restart for Large-Scale Message-Passing Systems," Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS '08), 2008.
[31] H. Garcia-Molina, "Elections in a Distributed Computing System," IEEE Trans. Computers, vol. C-31, no. 1, pp. 48-59, Jan. 1982.
[32] G. Cao and M. Singhal, "On Coordinated Checkpointing in Distributed Systems," IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 12, pp. 1213-1225, Dec. 1998.
[33] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 1, pp. 63-75, Feb. 1985.
[34] R. Strom, D. Bacon, and S. Yemini, "Volatile Logging in N-Fault-Tolerant Distributed Systems," Proc. 18th Int'l Symp. Fault-Tolerant Computing (FTCS), 1988.
[35] V.C. Zandy, "ckpt,", 2011.
[36] IBM, General Parallel File System (GPFS), com/systems/clusters/ software/gpfsindex.html, 2011.
[37] E.N. Elnozahy and J.S. Plank, "Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery," IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, pp. 97-108, Apr.-June 2004.
[38] D. Manivannan, Q. Jiang, J. Yang, and M. Singhal, "A Quasi-Synchronous Checkpointing Algorithm that Prevents Contention for Stable Storage," Information Sciences, vol. 178, no. 15, pp. 3110-3117, 2008.
[39] H.S. Kim and H.Y. Yeom, "A User-Transparent Recoverable File System for Distributed Computing Environment," Proc. Challenges of Large Applications in Distributed Environments (CLADE '05), 2005.
[40] K. Oh and M. Klein, "A General Purpose Parallel Molecular Dynamics Simulation Program," Proc. Computer Physics Comm., 2006.
[41] B. Lee, Numerical Study on the Control of Unsteady Separated Flow Fields. Seoul Nat'l Univ., 2005.
[42] S. Maruyama, S. Matsumoto, and A. Ogita, "Surface Phenomena of Molecular Clusters by Molecular Dynamics Method," Proc. Thermal Science and Eng., 1994.

Index Terms:
User transparency, fault tolerance, message passing interface, parallel systems, Myrinet, InfiniBand, ch_p4.
Hyungsoo Jung, Hyuck Han, Heon Y. Yeom, Sooyong Kang, "Athanasia: A User-Transparent and Fault-Tolerant System for Parallel Applications," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 10, pp. 1653-1668, Oct. 2011, doi:10.1109/TPDS.2011.63
Usage of this product signifies your acceptance of the Terms of Use.