loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fifth IEEE International Conference on Cluster Computing (CLUSTER'03)
Coordinated Checkpoint versus Message Log for Fault Tolerant MPI
Hong Kong
December 01-December 04
ISBN: 0-7695-2066-9
Aurélien Bouteiller, Université de Paris Sud
Pierre Lemarinier, Université de Paris Sud
Géraud Krawezik, Université de Paris Sud
Franck Cappello, Université de Paris Sud
MPI is one of the most adopted programming models for Large Clusters and Grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpointing or message logging associated with uncoordinated checkpoint. They are many protocols, implementations and optimizations for these approaches but few results about their comparison. Coordinated checkpoint has the advantage of a very low overhead on fault free executions. In contrary a message logging protocol systematically adds a significant message transfer penalty. The drawbacks of coordinated checkpoint come from its synchronization cost at checkpoint and restart times. In this paper we implement, evaluate and compare the two kinds of protocols with a special emphasis on their respective performance according to fault frequency. The main conclusion (under our experimental conditions) is that message logging becomes relevant for a large scale cluster from one fault every hour for applications with large dataset.
Index Terms:
Fault tolerant MPI, coordinated checkpoint, message log, performance
Citation:
Aurélien Bouteiller, Pierre Lemarinier, Géraud Krawezik, Franck Cappello, "Coordinated Checkpoint versus Message Log for Fault Tolerant MPI," cluster, pp.242, Fifth IEEE International Conference on Cluster Computing (CLUSTER'03), 2003
Usage of this product signifies your acceptance of the Terms of Use.