loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
20th IEEE International Conference on Distributed Computing Systems (ICDCS'00)
Coherence-Based Coordinated Checkpointing for Software Distributed Shared Memory Systems
Taipei, Taiwan
April 10-April 13
ISBN: 0-7695-0601-1
Angkul Kongmunvattana, University of Louisiana at Lafayette
Santipong Tanchatchawal, University of Louisiana at Lafayette
Nian-Feng Tzeng, University of Louisiana at Lafayette
Fault-tolerant techniques that can cope with system failures in software distributed shared memory (SDSM) are essential for creating productive and highly available parallel computing environments on clusters of workstations. In this paper, we propose a new, efficient coordinated checkpointing technique, called coherence-based coordinated checkpointing (CCC), for SDSM. Our CCC minimizes both the checkpointing overhead during failure-free execution and the cost of recovery from failures by leveraging existing coherence information maintained by SDSM. In the presence of system failures, it allows SDSM to recover from the most recent checkpoint, saving the re-computation time.We have performed experiments on a cluster of eight Sun Ultra-5 workstations, comparing our CCC technique against both simple coordinated checkpointing (SCC) and incremental coordinated checkpointing (ICC) techniques by actually implementing these techniques in TreadMarks, a state-of-the-art SDSM system. The experimental results demonstrate that our CCC technique consistently outperforms both SCC and ICC techniques. In particular, our technique increases the execution time slightly by 0.5% to 4% for a 2-minute checkpointing interval during failure-free execution, while SCC and ICC techniques result in the execution time overhead of 4% to 100% and 3% to 64%, respectively, for the same checkpointing interval.
Citation:
Angkul Kongmunvattana, Santipong Tanchatchawal, Nian-Feng Tzeng, "Coherence-Based Coordinated Checkpointing for Software Distributed Shared Memory Systems," icdcs, pp.556, 20th IEEE International Conference on Distributed Computing Systems (ICDCS'00), 2000
Usage of this product signifies your acceptance of the Terms of Use.