|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems
Shanghai, China
May 18-May 21
ISBN: 978-0-7695-3622-4
| ASCII Text | x | ||
| Pierre Riteau, Adrien Lèbre, Christine Morin, "Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems," Cluster Computing and the Grid, IEEE International Symposium on, pp. 404-411, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009. | |||
| BibTex | x | ||
| @article{ 10.1109/CCGRID.2009.29, author = {Pierre Riteau and Adrien Lèbre and Christine Morin}, title = {Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems}, journal ={Cluster Computing and the Grid, IEEE International Symposium on}, volume = {0}, year = {2009}, isbn = {978-0-7695-3622-4}, pages = {404-411}, doi = {http://doi.ieeecomputersociety.org/10.1109/CCGRID.2009.29}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - Cluster Computing and the Grid, IEEE International Symposium on TI - Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems SN - 978-0-7695-3622-4 SP404 EP411 A1 - Pierre Riteau, A1 - Adrien Lèbre, A1 - Christine Morin, PY - 2009 KW - Persistent state checkpointing KW - Process checkpoint/restart KW - Distributed file systems KW - Distributed architectures KW - High performance VL - 0 JA - Cluster Computing and the Grid, IEEE International Symposium on ER - | |||
Computer clusters are today the reference architecture for high-performance computing. The large number of nodes in these systems induces a high failure rate. This makes fault tolerance mechanisms, e.g. process checkpoint/restart, a required technology to effectively exploit clusters. Most of the process checkpoint/restart implementations only handle volatile states and do not take into account persistent states of applications, which can lead to incoherent application restarts. In this paper, we introduce an efficient persistent state checkpoint/restoration approach that can be interconnected with a large number of file systems. To avoid the performance issues of a stable support relying on synchronous replication mechanisms, we present a failure resilience scheme optimized for such persistent state checkpointing techniques in a distributed environment. First evaluations of our implementation in the kDFS distributed file system show the negligible performance impact of our proposal.
Index Terms:
Persistent state checkpointing, Process checkpoint/restart, Distributed file systems, Distributed architectures, High performance
Citation:
Pierre Riteau, Adrien Lèbre, Christine Morin, "Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems," ccgrid, pp.404-411, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009
Usage of this product signifies your acceptance of the Terms of Use.
