The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2013 vol.62)
pp: 772-783
D. Hakkarinen , Dept. of Electr. Eng. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
Zizhong Chen , Dept. of Comput. Sci. & Eng., Univ. of California, Riverside, Riverside, CA, USA
ABSTRACT
Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. Counteracting this higher failure rate may require a combination of disk-based checkpointing, diskless checkpointing, and algorithmic fault tolerance. Diskless checkpointing is an efficient technique to tolerate a small number of process failures in large parallel and distributed systems. In the literature, a simultaneous failure of no more than N processes is often tolerated by using a one-level Reed-Solomon checkpointing scheme for N simultaneous process failures, whose overhead often increases quickly as N increases. We introduce an N-level diskless checkpointing scheme that reduces the overhead for tolerating a simultaneous failure of up to N processes. Each level is a diskless checkpointing scheme for a simultaneous failure of i processes, where i = 1, 2,..., N. Simulation results indicate the proposed N-level diskless checkpointing scheme achieves lower fault tolerance overhead than the one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures.
INDEX TERMS
software fault tolerance, checkpointing, parallel processing, one-level Reed-Solomon checkpointing scheme, multilevel diskless checkpointing, extreme scale systems, disk-based checkpointing, algorithmic fault tolerance, parallel systems, distributed systems, Checkpointing, Encoding, Fault tolerance, Fault tolerant systems, Schedules, Reed-Solomon codes, Runtime, diskless checkpointing, Extreme scale systems, high-performance computing, fault tolerance, checkpoint
CITATION
D. Hakkarinen, Zizhong Chen, "Multilevel Diskless Checkpointing", IEEE Transactions on Computers, vol.62, no. 4, pp. 772-783, April 2013, doi:10.1109/TC.2012.17
REFERENCES
[1] P. Bohannon, J. Parker, R. Rastogi, S. Seshadri, A. Silberschatz, and S. Sudarshan, "Distributed Multi-Level Recovery in Main-Memory Databases," Proc. Fourth Int'l Conf. Parallel and Distributed Information Systems, pp. 44-55, 1996.
[2] Z. Chen and J. Dongarra, "Algorithm-Based Fault Tolerance for Fail-Stop Failures," IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 12, pp. 1628-1641, Dec. 2008.
[3] Z. Chen, "Optimal Real Number Codes for Fault Tolerant Matrix Operations," Proc. Conf. High Performance Computing Networking, Storage and Analysis (SC '09), pp. 14-20, Nov. 2009.
[4] T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen, "High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing," Proc. 25th ACM Int'l Conf. Supercomputing (ICS '11), May-June 2011.
[5] D. Hakkarinen and Z. Chen, "Algorithmic Cholesky Factorization Fault Recovery," Proc. IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '10), pp. 19-23, Apr. 2010.
[6] Z. Chen, "Algorithm-Based Recovery for Iterative Methods without Checkpointing," Proc. 20th ACM Int'l Symp. High-Performance Parallel and Distributed Computing (HPDC '11), pp. 8-11, June 2011.
[7] Z. Chen and J. Dongarra, "Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing," IEEE Trans. Computers, vol. 58, no. 11, pp. 1512-1524, Nov. 2009.
[8] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, "Fault Tolerant High Performance Computing by a Coding Approach," Proc. 10th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '05), pp. 213-223, 2005.
[9] C. Engelmann and A. Geist, "A Diskless Checkpointing Algorithm for Super-Scale Architectures Applied to the Fast Fourier Transform," Proc. First Int'l Workshop Challenges of Large Applications in Distributed Environments (CLADE '03), p. 47, 2003.
[10] K.-H. Huang and J.A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," IEEE Trans. Computers, vol. 33, no. 6, pp. 518-528, June 1984.
[11] Y. Kim, J.S. Plank, and J.J. Dongarra, "Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing," Proc. High-Performance Computing on the Information Superhighway (HPC-Asia '97), p. 460, 1997.
[12] J. Plank, K. Li, and M. Puening, "Diskless Checkpointing," IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972-986, Oct. 1998.
[13] J.S Plank, Y. Kim, and J. Dongarra, "Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing," J. Parallel and Distributed Computing, vol. 43, no. 2, pp. 125-138, June 1997.
[14] J.S. Plank and K. Li, "Faster Checkpointing with n + 1 Parity," technical report, Knoxville, TN, 1993.
[15] J.S. Plank, J. Xu, and R.H.B. Netzer, "Compressed Differences: An Algorithm for Fast Incremental Checkpointing," Technical Report CS-95-302, Univ. of Tennessee, Aug. 1995.
[16] Y. Shang, Y. Jin, and B. Wu, "Fault-Tolerant Mechanism of the Distributed Cluster Computers," Tsinghua Science and Technology, vol. 12, Supplement 1, pp. 186-191, 2007.
[17] Y. Shang, B. Wu, T. Li, and S. Fang, "Fault-Tolerant Technique in the Cluster Computation of the Digital Watershed Model," Tsinghua Science and Technology, vol. 12, Supplement 1, pp. 162-168, 2007.
[18] L. Silva and J. Silva, "Using Two-Level Stable Storage for Efficient Checkpointing," IEE Proc. Software, vol. 145, no. 6, pp. 198-202, 1998.
[19] L.M Silva and J.G Silva, "An Experimental Study about Diskless Checkpointing," Proc. EUROMICRO Conf., vol. 1, pp. 395-402, 1998.
[20] N.H Vaidya, "Another Two-Level Failure Recovery Scheme: Performance Impact of Checkpoint Placement and Checkpoint Latency," technical report, 1994.
[21] N.H Vaidya, "A Case for Two-Level Distributed Recovery Schemes," Proc. ACM SIGMETRICS Joint Int'l Conf. Measurement and Modeling of Computer Systems, pp. 64-73, 1995.
[22] A. Ziv, "Analysis and Performance Optimization of Checkpointing Schemes with Task Duplication," PhD thesis, Stanford, CA, 1996.
[23] G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill, "Automated Application-Level Checkpointing of MPI Programs," ACM SIGPLAN Notices, vol. 38, no. 10, pp. 84-94, June 2003.
[24] J.S Plank, "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-Like Systems," Software—Practice and Experience, vol. 27, no. 9, pp. 995-1012, Sept. 1997.
[25] E.N Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, Sept. 2002.
[26] J.F Chiu and G. Chiu, "Placing Forced Checkpoints in Distributed Real-Time Embedded Systems," Computing and Control Eng. J., vol. 13, no. 4, pp. 197-205, Aug. 2002.
[27] K. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 1, pp. 63-75, Feb. 1985.
[28] T. Lai and T. Yang, "On Distributed Snapshots," Information Processing Letters, vol. 24, no. 3, pp. 153-158, 1987.
[29] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, "The Performance of Consistent Checkpointing," Proc. 11th Symp. Reliable Distributed Systems, pp. 39-47, Oct. 1992.
[30] G. Zheng, L. Shi, and L.V. Kale, "FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI," Proc. IEEE Int'l Conf. Cluster Computing, pp. 93-103, Sept. 2004.
55 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool