The Community for Technology Leaders
SC Conference (2003)
Phoenix, Arizona
Nov. 15, 2003 to Nov. 21, 2003
ISBN: 1-58113-695-1
pp: 25
Pierre Lemarinier , LRI, Université de Paris Sud, Orsay, France
Géraud Krawezik , LRI, Université de Paris Sud, Orsay, France
Franck Cappello , LRI, Université de Paris Sud, Orsay, France; INRIA Futurs, Saclay, France
Thomas Hérault , LRI, Université de Paris Sud, Orsay, France
Frédéric Magniette , LRI, Université de Paris Sud, Orsay, France
Aurélien Bouteiller , LRI, Université de Paris Sud, Orsay, France
Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations.<div></div> We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks.<div></div> This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.
Pierre Lemarinier, Géraud Krawezik, Franck Cappello, Thomas Hérault, Frédéric Magniette, Aurélien Bouteiller, "MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging", SC Conference, vol. 00, no. , pp. 25, 2003, doi:10.1109/SC.2003.10027
89 ms
(Ver 3.3 (11022016))