|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
| ASCII Text | x | ||
| John Paul Walters, Vipin Chaudhary, "Replication-Based Fault Tolerance for MPI Applications," IEEE Transactions on Parallel and Distributed Systems, vol. 20, no. 7, pp. 997-1010, July, 2009. | |||
| BibTex | x | ||
| @article{ 10.1109/TPDS.2008.172, author = {John Paul Walters and Vipin Chaudhary}, title = {Replication-Based Fault Tolerance for MPI Applications}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {20}, number = {7}, issn = {1045-9219}, year = {2009}, pages = {997-1010}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2008.172}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - JOUR JO - IEEE Transactions on Parallel and Distributed Systems TI - Replication-Based Fault Tolerance for MPI Applications IS - 7 SN - 1045-9219 SP997 EP1010 EPD - 997-1010 A1 - John Paul Walters, A1 - Vipin Chaudhary, PY - 2009 KW - Fault tolerance KW - checkpointing KW - MPI KW - file systems. VL - 20 JA - IEEE Transactions on Parallel and Distributed Systems ER - | |||
[1] The MPI Forum, “MPI: A Message Passing Interface,” Proc. Ann. Supercomputing Conf. (SC '93), pp. 878-883, 1993.
[2] Q. Gao , W. Yu , W. Huang , and D.K. Panda , “Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand,” Proc. 35th Ann. Int'l Conf. Parallel Processing (ICPP '06), pp.471-478, 2006.
[3] G. Burns , R. Daoud , and J. Vaigl , “LAM: An Open Cluster Environment for MPI,” Proc. Supercomputing Symp., pp. 379-386, 1994.
[4] S. Sankaran , J.M. Squyres , B. Barrett , A. Lumsdaine , J. Duell , P. Hargrove , and E. Roman , “The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing,” Int'l J. High Performance Computing Applications, vol. 19, no. 4, pp. 479-493, 2005.
[5] J.P. Walters and V. Chaudhary , “A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications,” Proc. 14th Int'l Conf. High Performance Computing (HiPC '07), pp.257-268, 2007.
[6] D.H. Bailey , E. Barszcz , J.T. Barton , D.S. Browning , R.L. Carter , L. Dagum , R.A. Fatoohi , P.O. Frederickson , T.A. Lasinski , R.S. Schreiber , H.D. Simon , V. Venkatakrishnan , and S.K. Weeratunga , “The NAS Parallel Benchmarks,” Int'l J. High Performance Computing Applications, vol. 5, no. 3, pp. 63-73, 1991.
[7] J.M. Squyres and A. Lumsdaine , “A Component Architecture for LAM/MPI,” Proc. 10th European PVM/MPI Users' Group Meeting, pp. 379-387, 2003.
[8] InfiniBand Trade Assoc., InfiniBand, http://www.infinibandta. orghome, 2007.
[9] Myricom, Myrinet, http:/www.myricom.com, 2007.
[10] J. Hursey , J.M. Squyres , T.I. Mattox , and A. Lumsdaine , “The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI,” Proc. 21st Ann. Int'l Parallel and Distributed Processing Symp. (IPDPS), 2007.
[11] E.N. Elnozahy , L. Alvisi , Y.M. Wang , and D.B. Johnson , “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
[12] J.S. Plank , M. Beck , G. Kingsley , and K. Li , “Libckpt: Transparent Checkpointing under Unix,” Proc. Usenix Winter Technical Conf., pp. 213-223, 1995.
[13] J. Duell , “The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart,” Technical Report LBNL-54941, Lawrence Berkeley Nat'l Laboratory, 2002.
[14] R. Gioiosa , J.C. Sancho , S. Jiang , and F. Petrini , “Transparent, Incremental Checkpointing at Kernel Level: A Foundation for Fault Tolerance for Parallel Computers,” Proc. Ann. Supercomputing Conf. (SC '05), pp. 9-23, 2005.
[15] Y. Zhang , D. Wong , and W. Zheng , “User-Level Checkpoint and Recovery for LAM/MPI,” ACM SIGOPS Operating Systems Rev., vol. 39, no. 3, pp. 72-81, 2005.
[16] C. Wang , F. Mueller , C. Engelmann , and S.L. Scott , “A Job Pause Service under LAM/MPI $+$ BLCR for Transparent Fault Tolerance,” Proc. 21st Int'l Parallel and Distributed Processing Symp. (IPDPS '07), pp. 116-125, 2007.
[17] H. Jung , D. Shin , H. Han , J.W. Kim , H.Y. Yeom , and J. Lee , “Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet $(M^{3})$ ,” Proc. Ann. Supercomputing Conf. (SC '05), pp. 32-46, 2005.
[18] L. Kalé and S. Krishnan , “CHARM++: A Portable Concurrent Object Oriented System Based on C++,” Proc. Conf. Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA '93), pp. 91-108, 1993.
[19] C. Huang , G. Zheng , S. Kumar , and L.V. Kalé , “Performance Evaluation of Adaptive MPI,” Proc. 11th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '06), pp.306-322, 2006.
[20] S. Chakravorty , C. Mendes , and L.V. Kalé , “Proactive Fault Tolerance in MPI Applications via Task Migration,” Proc. 13th Int'l Conf. High Performance Computing (HiPC '06), pp. 485-496, 2006.
[21] S. Chakravorty and L.V. Kalé , “A Fault Tolerance Protocol with Fast Fault Recovery,” Proc. 21st Ann. Int'l Parallel and Distributed Processing Symp. (IPDPS '07), pp. 117-126, 2007.
[22] G. Zheng , L. Shi , and L.V. Kalé , “FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI,” Proc. IEEE Int'l Conf. Cluster Computing (Cluster '04), pp.93-103, 2004.
[23] J.S. Plank , “Improving the Performance of Coordinated Checkpointers on Networks of Workstations Using RAID Techniques,” Proc. 15th Symp. Reliable Distributed Systems (SRDS '96), pp. 76-85, 1996.
[24] J.S. Plank and L. Kai , “Faster Checkpointing with ${\rm N} + 1$ Parity,” Proc. 24th Ann. Int'l Symp. Fault-Tolerant Computing (SFTC '94), pp.288-297, 1994.
[25] R.Y. de Camargo , R. Cerqueira , and F. Kon , “Strategies for Storage of Checkpointing Data Using Non-Dedicated Repositories on Grid Systems,” Proc. Third Int'l Workshop Middleware for Grid Computing, 2005.
[26] X. Ren , R. Eigenmann , and S. Bagchi , “Failure-Aware Checkpointing in Fine-Grained Cycle Sharing Systems,” Proc. 16th Int'l Symp. High Performance Distributed Computing (HPDC '07), pp. 33-42, 2007.
[27] Z. Chen , G.E. Fagg , E. Gabriel , J. Langou , T. Angskun , G. Bosilca , and J. Dongarra , “Fault Tolerant High Performance Computing by a Coding Approach,” Proc. 10th Ann. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '05), pp.213-223, 2005.
[28] P. Nath , B. Urgaonkar , and A. Sivasubramaniam , “Evaluating the Usefulness of Content Addressable Storage for High-Performance Data Intensive Applications,” Proc. 17th Int'l Symp. High Performance Distributed Computing (HPDC '08), pp.35-44, 2008.
[29] J. Cao , Y. Li , and M. Guo , “Process Migration for MPI Applications Based on Coordinated Checkpoint,” Proc. 11th Ann. Int'l Conf. Parallel and Distributed Systems (ICPADS '05), pp. 306-312, 2005.
[30] V. Zandy , “CKPT: User-Level Checkpointing,” http://www.cs. wisc.edu/~zandyckpt/, 2005.
[31] R.E. Bryant , “Data-Intensive Supercomputing: The Case for DISC,“ Technical Report CMU-CS-07-128, School of Computer Science, Carnegie Mellon Univ., 2007.
[32] S. Gurumurthi , “Should Disks Be Speed Demons or Brainiacs?” ACM SIGOPS Operating Systems Rev., vol. 41, no. 1, pp. 33-36, 2007.
[33] Top500 List, http:/www.top500.org, 2008.
[34] J.J. Dongarra , P. Luszczek , and A. Petitet , “The LINPACK Benchmark: Past, Present, and Future,” Concurrency and Computation: Practice and Experience, vol. 15, pp. 1-18, 2003.
[35] C. Coti , T. Herault , P. Lemarinier , L. Pilard , A. Rezmerita , E. Rodriguez , and F. Cappello , “MPI Tools and Performance Studies—Blocking versus Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI,” Proc. 18th Ann. Supercomputing Conf. (SC '06), pp. 127-140, 2006.
[36] Q. Lv , P. Cao , E. Cohen , K. Li , and S. Shenker , “Search and Replication in Unstructured Peer-to-Peer Networks,” Proc. 16th Int'l Conf. Supercomputing (ICS '02), pp. 84-95, 2002.
[37] S. Hoory , N. Linial , and A. Wigderson , “Expander Graphs and Their Applications,” Bull. of the Am. Math. Soc., vol. 43, no. 4, pp.439-561, 2006.
[38] CiFTS: Coordinated Infrastructure for Fault Tolerant Systems, http://www.mcs.anl.gov/researchcifts/, 2008.
[39] E. Pinheiro , W.-D. Weber , and L.A. Barroso , “Failure Trends in a Large Disk Drive Population,” Proc. Fifth Usenix Conf. File and Storage Technologies (FAST '07), pp. 17-28, 2007.
[40] Q. Gao , W. Huang , M.J. Koop , and D.K. Panda , “Group-Based Coordinated Checkpointing for MPI: A Case Study on InfiniBand,” Proc. 36th Ann. Int'l Conf. Parallel Processing (ICPP '07), pp.47-54, 2007.

