Cluster Computing and the Grid, IEEE International Symposium on (2008)
May 19, 2008 to May 22, 2008
The GridRPC model is well suited for high performance computing on grids thanks to efficiently solving most of the issues raised by geographically and administratively split resources. Because of large scale, long range networks and heterogeneity, Grids are extremely prone to failures. GridRPC middleware are usually managing failures by using 1) TCP or other link network layer provided failure detector, 2) automatic checkpoints of sequential jobs and 3) a centralized stable agent to perform scheduling. Most recent developments have provided some new mechanisms like the optimal Chandra~\&~Toueg~\&~Aguillera failure detector, most numerical libraries now providing their own optimized checkpoint routine and distributed scheduling GridRPC architectures. In this paper we aim at adapting to these novelties by providing the first implementation and evaluation in a grid system of the optimal fault detector, a novel and simple checkpoint API allowing to manage both service provided checkpoint and automatic checkpoint (even for parallel services) and a scheduling hierarchy recovery algorithm tolerating several simultaneous failures. All those mechanisms are implemented and evaluated on a real grid in the \diet middleware.
GridRPC, Fault tolerant, Failure detector, Checkpoint, Distributed algorithm
A. Bouteiller and F. Desprez, "Fault Tolerance Management for a Hierarchical GridRPC Middleware," 2008 8th International Symposium on Cluster Computing and the Grid (CCGRID '08)(CCGRID), Lyon, 2008, pp. 484-491.