This Article 
 Bibliographic References 
 Add to: 
Multilevel Diskless Checkpointing
April 2013 (vol. 62 no. 4)
pp. 772-783
D. Hakkarinen, Dept. of Electr. Eng. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
Zizhong Chen, Dept. of Comput. Sci. & Eng., Univ. of California, Riverside, Riverside, CA, USA
Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. Counteracting this higher failure rate may require a combination of disk-based checkpointing, diskless checkpointing, and algorithmic fault tolerance. Diskless checkpointing is an efficient technique to tolerate a small number of process failures in large parallel and distributed systems. In the literature, a simultaneous failure of no more than N processes is often tolerated by using a one-level Reed-Solomon checkpointing scheme for N simultaneous process failures, whose overhead often increases quickly as N increases. We introduce an N-level diskless checkpointing scheme that reduces the overhead for tolerating a simultaneous failure of up to N processes. Each level is a diskless checkpointing scheme for a simultaneous failure of i processes, where i = 1, 2,..., N. Simulation results indicate the proposed N-level diskless checkpointing scheme achieves lower fault tolerance overhead than the one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures.
Index Terms:
software fault tolerance,checkpointing,parallel processing,one-level Reed-Solomon checkpointing scheme,multilevel diskless checkpointing,extreme scale systems,disk-based checkpointing,algorithmic fault tolerance,parallel systems,distributed systems,Checkpointing,Encoding,Fault tolerance,Fault tolerant systems,Schedules,Reed-Solomon codes,Runtime,diskless checkpointing,Extreme scale systems,high-performance computing,fault tolerance,checkpoint
D. Hakkarinen, Zizhong Chen, "Multilevel Diskless Checkpointing," IEEE Transactions on Computers, vol. 62, no. 4, pp. 772-783, April 2013, doi:10.1109/TC.2012.17
Usage of this product signifies your acceptance of the Terms of Use.