loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
23rd IEEE International Symposium on Reliable Distributed Systems (SRDS'04)
Skewed Checkpointing for Tolerating Multi-Node Failures
Florianpolis, Brazil
October 18-October 20
ISBN: 0-7695-2239-4
Hiroshi Nakamura, The University of Tokyo, Japan
Takuro Hayashida, The University of Tokyo, Japan
Masaaki Kondo, The University of Tokyo, Japan
Yuya Tajima, The University of Tokyo, Japan
Masashi Imai, The University of Tokyo, Japan
Takashi Nanya, The University of Tokyo, Japan
Large cluster systems have become widely utilized because they achieve a good performance/cost ratio especially in high performance computing. Although these cluster systems are distributed memory systems, coordinated check-pointing is a promising way to maintain high availability because the computing nodes are tightly connected to one another. However, as the number of computing nodes gets larger, the probability of multi-node failures increases. To tolerate multi-node failures, a large degree of redundancy is required in checkpointing, but this leads to performance degradation.
Thus, we propose a new coordinated checkpointing called skewed checkpointing. In this method, checkpointing is skewed every time. Although each checkpointing itself contains only one degree of redundancy, this skewed check-pointing ensures [log₂ N] degrees of redundancy when the number of nodes is N.
In this paper, we present the proposed method and an analysis of the performance overhead. Then, this method is applied to a cluster system and compared with other conventional checkpointing schemes. The results reveal the superiority of our method, especially for large cluster systems.
Citation:
Hiroshi Nakamura, Takuro Hayashida, Masaaki Kondo, Yuya Tajima, Masashi Imai, Takashi Nanya, "Skewed Checkpointing for Tolerating Multi-Node Failures," srds, pp.116-125, 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS'04), 2004
Usage of this product signifies your acceptance of the Terms of Use.