23rd IEEE International Symposium on Reliable Distributed Systems (SRDS'04) Skewed Checkpointing for Tolerating Multi-Node Failures Florianpolis, Brazil October 18-October 20 ISBN: 0-7695-2239-4
Large cluster systems have become widely utilized because they achieve a good performance/cost ratio especially in high performance computing. Although these cluster systems are distributed memory systems, coordinated check-pointing is a promising way to maintain high availability because the computing nodes are tightly connected to one another. However, as the number of computing nodes gets larger, the probability of multi-node failures increases. To tolerate multi-node failures, a large degree of redundancy is required in checkpointing, but this leads to performance degradation.Thus, we propose a new coordinated checkpointing called skewed checkpointing. In this method, checkpointing is skewed every time. Although each checkpointing itself contains only one degree of redundancy, this skewed check-pointing ensures [log₂ N] degrees of redundancy when the number of nodes is N.In this paper, we present the proposed method and an analysis of the performance overhead. Then, this method is applied to a cluster system and compared with other conventional checkpointing schemes. The results reveal the superiority of our method, especially for large cluster systems.
Citation:
Hiroshi Nakamura, Takuro Hayashida, Masaaki Kondo, Yuya Tajima, Masashi Imai, Takashi Nanya, "Skewed Checkpointing for Tolerating Multi-Node Failures," srds, pp.116-125, 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS'04), 2004 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||