18th International Parallel and Distributed Processing Symposium (IPDPS'04) - Workshop 11
A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations
Santa Fe, New Mexico
April 26-April 30
ISBN: 0-7695-2132-0
Code coupling applications can be divided into communicating modules, that may be executed on different clusters in a cluster federation. As a cluster federation comprises of a large number of nodes, there is a high probability of a node failure. We propose a hierarchical checkpointing protocol that combines a synchronized checkpointing technique inside clusters and a communication-induced technique between clusters. This protocol fits to the characteristics of a cluster federation (large number of nodes, high latency and low bandwidth networking technologies between clusters). A preliminary performance evaluation performed using a discrete event simulator shows that the protocol is suitable for code coupling applications.
Index Terms:
Cluster Federation, Checkpointing and Recovery, Fault-tolerance, Parallel Application, Code Coupling
Citation:
Sébastien Monnet, Christine Morin, Ramamurthy Badrinath, "A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations," ipdps, vol. 12, pp.211a, 18th International Parallel and Distributed Processing Symposium (IPDPS'04) - Workshop 11, 2004