The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.07 - July (2009 vol.20)
pp: 997-1010
John Paul Walters , University at Buffalo, Buffalo
Vipin Chaudhary , University at Buffalo, Buffalo
ABSTRACT
As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system.
INDEX TERMS
Fault tolerance, checkpointing, MPI, file systems.
CITATION
John Paul Walters, Vipin Chaudhary, "Replication-Based Fault Tolerance for MPI Applications", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 7, pp. 997-1010, July 2009, doi:10.1109/TPDS.2008.172
REFERENCES
[1] The MPI Forum, “MPI: A Message Passing Interface,” Proc. Ann. Supercomputing Conf. (SC '93), pp. 878-883, 1993.
[2] Q. Gao , W. Yu , W. Huang , and D.K. Panda , “Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand,” Proc. 35th Ann. Int'l Conf. Parallel Processing (ICPP '06), pp.471-478, 2006.
[3] G. Burns , R. Daoud , and J. Vaigl , “LAM: An Open Cluster Environment for MPI,” Proc. Supercomputing Symp., pp. 379-386, 1994.
[4] S. Sankaran , J.M. Squyres , B. Barrett , A. Lumsdaine , J. Duell , P. Hargrove , and E. Roman , “The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing,” Int'l J. High Performance Computing Applications, vol. 19, no. 4, pp. 479-493, 2005.
[5] J.P. Walters and V. Chaudhary , “A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications,” Proc. 14th Int'l Conf. High Performance Computing (HiPC '07), pp.257-268, 2007.
[6] D.H. Bailey , E. Barszcz , J.T. Barton , D.S. Browning , R.L. Carter , L. Dagum , R.A. Fatoohi , P.O. Frederickson , T.A. Lasinski , R.S. Schreiber , H.D. Simon , V. Venkatakrishnan , and S.K. Weeratunga , “The NAS Parallel Benchmarks,” Int'l J. High Performance Computing Applications, vol. 5, no. 3, pp. 63-73, 1991.
[7] J.M. Squyres and A. Lumsdaine , “A Component Architecture for LAM/MPI,” Proc. 10th European PVM/MPI Users' Group Meeting, pp. 379-387, 2003.
[8] InfiniBand Trade Assoc., InfiniBand, http://www.infinibandta. orghome, 2007.
[9] Myricom, Myrinet, http:/www.myricom.com, 2007.
[10] J. Hursey , J.M. Squyres , T.I. Mattox , and A. Lumsdaine , “The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI,” Proc. 21st Ann. Int'l Parallel and Distributed Processing Symp. (IPDPS), 2007.
[11] E.N. Elnozahy , L. Alvisi , Y.M. Wang , and D.B. Johnson , “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
[12] J.S. Plank , M. Beck , G. Kingsley , and K. Li , “Libckpt: Transparent Checkpointing under Unix,” Proc. Usenix Winter Technical Conf., pp. 213-223, 1995.
[13] J. Duell , “The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart,” Technical Report LBNL-54941, Lawrence Berkeley Nat'l Laboratory, 2002.
[14] R. Gioiosa , J.C. Sancho , S. Jiang , and F. Petrini , “Transparent, Incremental Checkpointing at Kernel Level: A Foundation for Fault Tolerance for Parallel Computers,” Proc. Ann. Supercomputing Conf. (SC '05), pp. 9-23, 2005.
[15] Y. Zhang , D. Wong , and W. Zheng , “User-Level Checkpoint and Recovery for LAM/MPI,” ACM SIGOPS Operating Systems Rev., vol. 39, no. 3, pp. 72-81, 2005.
[16] C. Wang , F. Mueller , C. Engelmann , and S.L. Scott , “A Job Pause Service under LAM/MPI $+$ BLCR for Transparent Fault Tolerance,” Proc. 21st Int'l Parallel and Distributed Processing Symp. (IPDPS '07), pp. 116-125, 2007.
[17] H. Jung , D. Shin , H. Han , J.W. Kim , H.Y. Yeom , and J. Lee , “Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet $(M^{3})$ ,” Proc. Ann. Supercomputing Conf. (SC '05), pp. 32-46, 2005.
[18] L. Kalé and S. Krishnan , “CHARM++: A Portable Concurrent Object Oriented System Based on C++,” Proc. Conf. Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA '93), pp. 91-108, 1993.
[19] C. Huang , G. Zheng , S. Kumar , and L.V. Kalé , “Performance Evaluation of Adaptive MPI,” Proc. 11th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '06), pp.306-322, 2006.
[20] S. Chakravorty , C. Mendes , and L.V. Kalé , “Proactive Fault Tolerance in MPI Applications via Task Migration,” Proc. 13th Int'l Conf. High Performance Computing (HiPC '06), pp. 485-496, 2006.
[21] S. Chakravorty and L.V. Kalé , “A Fault Tolerance Protocol with Fast Fault Recovery,” Proc. 21st Ann. Int'l Parallel and Distributed Processing Symp. (IPDPS '07), pp. 117-126, 2007.
[22] G. Zheng , L. Shi , and L.V. Kalé , “FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI,” Proc. IEEE Int'l Conf. Cluster Computing (Cluster '04), pp.93-103, 2004.
[23] J.S. Plank , “Improving the Performance of Coordinated Checkpointers on Networks of Workstations Using RAID Techniques,” Proc. 15th Symp. Reliable Distributed Systems (SRDS '96), pp. 76-85, 1996.
[24] J.S. Plank and L. Kai , “Faster Checkpointing with ${\rm N} + 1$ Parity,” Proc. 24th Ann. Int'l Symp. Fault-Tolerant Computing (SFTC '94), pp.288-297, 1994.
[25] R.Y. de Camargo , R. Cerqueira , and F. Kon , “Strategies for Storage of Checkpointing Data Using Non-Dedicated Repositories on Grid Systems,” Proc. Third Int'l Workshop Middleware for Grid Computing, 2005.
[26] X. Ren , R. Eigenmann , and S. Bagchi , “Failure-Aware Checkpointing in Fine-Grained Cycle Sharing Systems,” Proc. 16th Int'l Symp. High Performance Distributed Computing (HPDC '07), pp. 33-42, 2007.
[27] Z. Chen , G.E. Fagg , E. Gabriel , J. Langou , T. Angskun , G. Bosilca , and J. Dongarra , “Fault Tolerant High Performance Computing by a Coding Approach,” Proc. 10th Ann. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '05), pp.213-223, 2005.
[28] P. Nath , B. Urgaonkar , and A. Sivasubramaniam , “Evaluating the Usefulness of Content Addressable Storage for High-Performance Data Intensive Applications,” Proc. 17th Int'l Symp. High Performance Distributed Computing (HPDC '08), pp.35-44, 2008.
[29] J. Cao , Y. Li , and M. Guo , “Process Migration for MPI Applications Based on Coordinated Checkpoint,” Proc. 11th Ann. Int'l Conf. Parallel and Distributed Systems (ICPADS '05), pp. 306-312, 2005.
[30] V. Zandy , “CKPT: User-Level Checkpointing,” http://www.cs. wisc.edu/~zandyckpt/, 2005.
[31] R.E. Bryant , “Data-Intensive Supercomputing: The Case for DISC,“ Technical Report CMU-CS-07-128, School of Computer Science, Carnegie Mellon Univ., 2007.
[32] S. Gurumurthi , “Should Disks Be Speed Demons or Brainiacs?” ACM SIGOPS Operating Systems Rev., vol. 41, no. 1, pp. 33-36, 2007.
[33] Top500 List, http:/www.top500.org, 2008.
[34] J.J. Dongarra , P. Luszczek , and A. Petitet , “The LINPACK Benchmark: Past, Present, and Future,” Concurrency and Computation: Practice and Experience, vol. 15, pp. 1-18, 2003.
[35] C. Coti , T. Herault , P. Lemarinier , L. Pilard , A. Rezmerita , E. Rodriguez , and F. Cappello , “MPI Tools and Performance Studies—Blocking versus Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI,” Proc. 18th Ann. Supercomputing Conf. (SC '06), pp. 127-140, 2006.
[36] Q. Lv , P. Cao , E. Cohen , K. Li , and S. Shenker , “Search and Replication in Unstructured Peer-to-Peer Networks,” Proc. 16th Int'l Conf. Supercomputing (ICS '02), pp. 84-95, 2002.
[37] S. Hoory , N. Linial , and A. Wigderson , “Expander Graphs and Their Applications,” Bull. of the Am. Math. Soc., vol. 43, no. 4, pp.439-561, 2006.
[38] CiFTS: Coordinated Infrastructure for Fault Tolerant Systems, http://www.mcs.anl.gov/researchcifts/, 2008.
[39] E. Pinheiro , W.-D. Weber , and L.A. Barroso , “Failure Trends in a Large Disk Drive Population,” Proc. Fifth Usenix Conf. File and Storage Technologies (FAST '07), pp. 17-28, 2007.
[40] Q. Gao , W. Huang , M.J. Koop , and D.K. Panda , “Group-Based Coordinated Checkpointing for MPI: A Case Study on InfiniBand,” Proc. 36th Ann. Int'l Conf. Parallel Processing (ICPP '07), pp.47-54, 2007.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool