loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
Fault Tolerance Management for a Hierarchical GridRPC Middleware
May 19-May 22
ISBN: 978-0-7695-3156-4
The GridRPC model is well suited for high performance computing on grids thanks to efficiently solving most of the issues raised by geographically and administratively split resources. Because of large scale, long range networks and heterogeneity, Grids are extremely prone to failures. GridRPC middleware are usually managing failures by using 1) TCP or other link network layer provided failure detector, 2) automatic checkpoints of sequential jobs and 3) a centralized stable agent to perform scheduling. Most recent developments have provided some new mechanisms like the optimal Chandra~\&~Toueg~\&~Aguillera failure detector, most numerical libraries now providing their own optimized checkpoint routine and distributed scheduling GridRPC architectures. In this paper we aim at adapting to these novelties by providing the first implementation and evaluation in a grid system of the optimal fault detector, a novel and simple checkpoint API allowing to manage both service provided checkpoint and automatic checkpoint (even for parallel services) and a scheduling hierarchy recovery algorithm tolerating several simultaneous failures. All those mechanisms are implemented and evaluated on a real grid in the \diet middleware.
Index Terms:
GridRPC, Fault tolerant, Failure detector, Checkpoint, Distributed algorithm
Citation:
Aurelien Bouteiller, Frederic Desprez, "Fault Tolerance Management for a Hierarchical GridRPC Middleware," ccgrid, pp.484-491, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008
Usage of this product signifies your acceptance of the Terms of Use.