loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP'07)
Fault-tolerant solutions for a MPI compute intensive application
Naples, Italy
February 07-February 09
ISBN: 0-7695-2784-1
J.C. Mourino, CESGA (Supercomputing Center of Galicia), Spain
M.J. Martin, Univ. A Coruna, Spain
P. Gonzalez, Univ. A Coruna, Spain
R. Doallo, Univ. A Coruna, Spain
The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that no all computation done is lost on machine failures. Checkpointing and rollback recovery is a very useful technique to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhace with fault tolerant capability their applications. This work presents two different approaches to endow with fault tolerance the MPI version of an air quality simulation. A segment?-evel solution has been implemented by means of the extension of a checkpointing library for sequential codes. A variable level solution has been implemented manually in the code. The main differences between both approaches are portability, transparency-level and checkpointing overheads. Experimental results comparing both strategies on a cluster of PCs are shown in the paper.
Citation:
J.C. Mourino, M.J. Martin, P. Gonzalez, R. Doallo, "Fault-tolerant solutions for a MPI compute intensive application," pdp, pp.246-253, 15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP'07), 2007
Usage of this product signifies your acceptance of the Terms of Use.