loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
May 19-May 22
ISBN: 978-0-7695-3156-4
For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints. Moreover, a method to find the number of those incremental checkpoints is given. Furthermore, most of the comparison results between the incremental checkpoint model and the full checkpoint model [19] on the same failure data set show that the total waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model.
Citation:
Nichamon Naksinehaboon, Yudan Liu, Chokchai (Box) Leangsuksun, Raja Nassar, Mihaela Paun, Stephen L. Scott, "Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments," ccgrid, pp.783-788, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008
Usage of this product signifies your acceptance of the Terms of Use.