Issue No. 04 - April (2013 vol. 62)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2012.17
D. Hakkarinen , Dept. of Electr. Eng. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
Zizhong Chen , Dept. of Comput. Sci. & Eng., Univ. of California, Riverside, Riverside, CA, USA
Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. Counteracting this higher failure rate may require a combination of disk-based checkpointing, diskless checkpointing, and algorithmic fault tolerance. Diskless checkpointing is an efficient technique to tolerate a small number of process failures in large parallel and distributed systems. In the literature, a simultaneous failure of no more than N processes is often tolerated by using a one-level Reed-Solomon checkpointing scheme for N simultaneous process failures, whose overhead often increases quickly as N increases. We introduce an N-level diskless checkpointing scheme that reduces the overhead for tolerating a simultaneous failure of up to N processes. Each level is a diskless checkpointing scheme for a simultaneous failure of i processes, where i = 1, 2,..., N. Simulation results indicate the proposed N-level diskless checkpointing scheme achieves lower fault tolerance overhead than the one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures.
software fault tolerance, checkpointing, parallel processing, one-level Reed-Solomon checkpointing scheme, multilevel diskless checkpointing, extreme scale systems, disk-based checkpointing, algorithmic fault tolerance, parallel systems, distributed systems, Checkpointing, Encoding, Fault tolerance, Fault tolerant systems, Schedules, Reed-Solomon codes, Runtime, diskless checkpointing, Extreme scale systems, high-performance computing, fault tolerance, checkpoint
D. Hakkarinen, Zizhong Chen, "Multilevel Diskless Checkpointing", IEEE Transactions on Computers, vol. 62, no. , pp. 772-783, April 2013, doi:10.1109/TC.2012.17