This Article 
 Bibliographic References 
 Add to: 
A New Diskless Checkpointing Approach for Multiple Processor Failures
July/August 2011 (vol. 8 no. 4)
pp. 481-493
Ge-Ming Chiu, National Taiwan University of Science and Technology, Taipei
Jane-Ferng Chiu, Tungnan University, Taipei
Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple checkpointing and failure recovery operations, without relying on dedicated checkpoint processors. In this scheme, each processor saves its checkpoints in a set of peer processors, called checkpoint storage nodes. In return, each processor uses simple XOR operations to store a collection of checkpoints for the processors for which it is a checkpoint storage node. This study defines the concept of safe recovery criterion, which specifies the requirement for ensuring that any failed processor can be recovered in a single step using the checkpoint data stored at one of the surviving processors, as long as no more than a given number of failures occur. This study further identifies the necessary and sufficient conditions for satisfying the safe recovery criterion and presents a method for designing checkpoint storage node sets that meet these requirements. The proposed scheme allows failure recovery to be performed in a distributed manner using XOR operations.

[1] N.R. Adiga et al., “An Overview of the BlueGene/L Supercomputer,” Proc. ACM/IEEE Symp. Supercomputing (SC '02), pp. 60-60, Nov. 2002.
[2] Berkeley Lab Checkpoint/Restart (BLCR), https://ftg.lbl.govCheckpointRestart, 2010.
[3] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Fault Tolerant High Performance Computing by a Coding Approach,” Proc. ACM Symp. Principles and Practice of Parallel Programming (PPoPP '05), pp. 213-223, June 2005.
[4] Z. Chen and J. Dongarra, “A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing,” Proc. IEEE Symp. High Assurance Systems Eng. Symp. (HASE '08), pp. 71-79, Dec. 2008.
[5] Z. Chen and J. Dongarra, “Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing,” IEEE Trans. Computers, vol. 58, no. 11, pp. 1512-1524, Nov. 2009.
[6] G.-M. Chiu and C.-R. Young, “Efficient Rollback-Recovery Technique in Distributed Computing Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 6, pp. 565-577, June 1996.
[7] J.-F. Chiu and G.-M. Chiu, “Hardware-Supported Asynchronous Checkpointing Scheme,” IEE Proc.—Computers and Digital Techniques, vol. 145, no. 2, pp. 109-115, Mar. 1998.
[8] J.-F. Chiu and G.-M. Chiu, “Placing Forced Checkpoints in Distributed Real-Time Embedded Systems,” J. Computing and Control Eng., vol. 13, no. 4, pp. 197-205, Aug. 2002.
[9] J.-F. Chiu and W.-H. Hao, “Mutual-Aid: Diskless Checkpointing Scheme for Tolerating Double Faults,” Proc. IEEE Symp. High Performance Computing and Comm. (HPCC '08), pp. 540-547, Sept. 2008.
[10] T.-C. Chiueh and P. Deng, “Evaluation of Checkpoint Mechanisms for Massively Parallel Machines,” Proc. IEEE Symp. Fault Tolerant Computing (FTCS '96), pp. 370-379, June 1996.
[11] E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, Sept. 2002.
[12] E. Elnozahy and J. Plank, “Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, pp. 97-108, Apr./June 2004.
[13] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, “The Performance of Consistent Checkpointing,” Proc. IEEE Symp. Reliable Distributed Systems (RDS '92), pp. 39-47, Oct. 1992.
[14] G.H. Forman and J. Zahorjan, “The Challenges of Mobile Computing,” Computer, vol. 27, no. 4, pp. 38-47, Apr. 1994.
[15] S.I. Feldman and C.B. Brown, “Igor: A System for Program Debugging via Reversible Execution,” ACM SIPLAN Notices, Workshop Parallel and Distributed Debugging (PADD '88), pp. 112-123, Jan. 1989.
[16] J.M. Hélary, A. Mostefaoui, R.H. Netzer, and M. Raynal, “Preventing Useless Checkpoints in Distributed Computations,” Proc. IEEE Symp. Reliable Distributed Systems (RDS '97), pp. 183-190, Oct. 1997.
[17] bcsstruc3bcsstk23.html, 2010.
[18] T.-Y. Juang and M.-C. Liu, “An Efficient Asynchronous Recovery Algorithm in Wireless Mobile Ad Hoc Networks,” J. Wireless Internet: Appl. and Systems, vol. 3, no. 2, pp. 147-155, 2002.
[19] C.-J. Li and W.K. Fuchs, “CATCH—Compiler-Assisted Techniques for Checkpointing,” Proc. IEEE Symp. Fault Tolerant Computing (FTCS '90), pp. 74-81, 1990.
[20] J. Luo, L. Xu, and J.S. Plank, “An Efficient XOR-Scheduling Algorithm for Erasure Code Encoding,” Proc. IEEE Symp. Dependable Systems and Networks (DSN '09), pp. 504-513, June 2009.
[21] Q.M. Malluhi and W.E. Johnston, “Coding for High Availability of a Distributed-Parallel Storage System,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 12, pp. 1237-1252, Dec. 1998.
[22] J.S. Plank, “A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-Like Systems,” Software—Practice and Experience, vol. 27, no. 9, pp. 995-1012, Sept. 1997.
[23] J.S. Plank, “A New MDS Erasure Code for RAID-6,” Technical Report CS-07-602, Dept. of Electrical Eng. and Computer Science, Univ. of Tennessee, Sept. 2007.
[24] J.S. Plank, Y. Kim, and J. Dongarra, “Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations,” Proc. IEEE Symp. Fault-Tolerant Computing (FTCS '95), pp. 351-360, June 1995.
[25] J.S. Plank, Y. Kim, and J. Dongarra, “Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing,” J. Parallel Distributed Computing, vol. 43, no. 2, pp. 125-138, 1997.
[26] J.S. Plank and K. Li, “Faster Checkpointing with N + 1 Parity,” Proc. IEEE Symp. Fault-Tolerant Computing (FTCS '94), pp. 288-297, June 1994.
[27] J.S. Plank, K. Li, and M.A. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel Distributed Systems, vol. 9, no. 10, pp. 972-986, Oct. 1998.
[28] J.S. Plank, M. Beck, G. Kingsley, and K. Li, “Libckpt: Transparent Checkpointing under Unix,” Proc. Usenix Winter 1995 Technical Conf., pp. 213-223, Jan. 1995.
[29] J.S. Plank, S. Simmerman, and C.D. Schuman, “Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications— Version 1.2,” Technical Report CS-08-627, Univ. of Tennessee, Aug. 2008.
[30] J.S. Plank and M.G. Thomason, “Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems,” J. Parallel Distributed Computing, vol. 61, no. 11, pp. 1570-1590, Nov. 2001.
[31] R.D. Schlichting and F.B. Schneider, “Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems,” ACM Trans. Computer Systems, vol. 1, no. 3, pp. 222-238, Aug. 1983.
[32] L.M. Silva and J.G. Silva, “An Experimental Study about Diskless Checkpointing,” Proc. Euromicro Conf. (EUROMICRO '98), pp. 395-402, Aug. 1998.
[33] L.M. Silva and J.G. Silva, “Using Two-Level Stable Storage for Efficient Checkpointing,” IEE Proc.—Software, vol. 145, no. 6, pp. 198-202, Dec. 1998.
[34] J. Tsai, S.Y. Kuo, and Y.M. Wang, “More Properties of Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability,” J. Information Science and Eng., vol. 21, no. 2, pp. 239-257, Mar. 2005.
[35] J. Tsai, C.Y. Lin, and S.Y. Kuo, “Adaptive Communication-Induced Checkpointing Protocols with Domino-Effect Freedom,” J. Information Science and Eng., vol. 20, no. 5, pp. 885-901, Sept. 2004.
[36] N.H. Vaidya, “A Case for Two-Level Recovery Schemes,” IEEE Trans. Computers, vol. 47, no. 6, pp. 656-666, June 1998.
[37] Y.M. Wang, “Consistent Global Checkpoints that Contain a Set of Local Checkpoints,” IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, Apr. 1997.
[38] S. Yi, J. Heo, Y. Cho, and J. Hong, “Adaptive Mobile Checkpointing Facility for Wireless Sensor Networks,” Lecture Notes in Computer Science, pp. 701-709, Springer-Verlag, Apr. 2006.
[39] J.W. Young, “A First Order Approximation to the Optimum Checkpoint Interval,” Comm. ACM, vol. 17, no. 9, pp. 530-531, Sept. 1974.

Index Terms:
Diskless checkpointing, multiple failures, rollback recovery, XOR.
Ge-Ming Chiu, Jane-Ferng Chiu, "A New Diskless Checkpointing Approach for Multiple Processor Failures," IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 4, pp. 481-493, July-Aug. 2011, doi:10.1109/TDSC.2010.76
Usage of this product signifies your acceptance of the Terms of Use.