|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment
May 19-May 22
ISBN: 978-0-7695-3156-4
| ASCII Text | x | ||
| Fatiha Bouabache, Thomas Herault, Gilles Fedak, Franck Cappello, "Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment," Cluster Computing and the Grid, IEEE International Symposium on, pp. 475-483, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008. | |||
| BibTex | x | ||
| @article{ 10.1109/CCGRID.2008.95, author = {Fatiha Bouabache and Thomas Herault and Gilles Fedak and Franck Cappello}, title = {Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment}, journal ={Cluster Computing and the Grid, IEEE International Symposium on}, volume = {0}, year = {2008}, isbn = {978-0-7695-3156-4}, pages = {475-483}, doi = {http://doi.ieeecomputersociety.org/10.1109/CCGRID.2008.95}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - Cluster Computing and the Grid, IEEE International Symposium on TI - Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment SN - 978-0-7695-3156-4 SP475 EP483 A1 - Fatiha Bouabache, A1 - Thomas Herault, A1 - Gilles Fedak, A1 - Franck Cappello, PY - 2008 KW - Grid KW - fault-tolerance KW - Replication KW - Distributed storage VL - 0 JA - Cluster Computing and the Grid, IEEE International Symposium on ER - | |||
As High Performance platforms (Clusters, Grids, etc.) continue to??grow in size, the average time between failures decreases to a critical level. An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage. Most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such failures lead to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this??administrative domain. Thus it is not safe to rely on the high MTBF (Mean Time Between Failures) of specific machines to store the checkpoint images. This paper introduces a new coordinated checkpoint protocol, which tolerates checkpoint server failures and clusters failures, and ensures a checkpoint storage reliability in a grid environment. To provide this reliability the protocol is based on a replication process. We propose new hierarchical replication strategies, with two different degrees of hierarchy, adapted to the topology of cluster of clusters. Our solution exploits the locality of checkpoint images in order to minimize inter-cluster communication.We evaluate the effectiveness of our two hierarchical replication strategies through simulations against several criteria such as topology and scalability.
Index Terms:
Grid, fault-tolerance, Replication, Distributed storage
Citation:
Fatiha Bouabache, Thomas Herault, Gilles Fedak, Franck Cappello, "Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment," ccgrid, pp.475-483, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008
Usage of this product signifies your acceptance of the Terms of Use.
