This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
On Coordinated Checkpointing in Distributed Systems
December 1998 (vol. 9 no. 12)
pp. 1213-1225

Abstract—Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm [18] combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: There does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems.

[1] A. Acharya and B.R. Badrinath, "Checkpointing Distributed Applications on Mobil Computers," Proc. Third Int'l Conf. Parallel and Distributed Information Systems, Sept. 1994.
[2] G. Barigazzi and L. Strigini, "Application-Transparent Setting of Recovery Points," Digest of Papers, Proc. 13th Fault Tolerant Computing Symp. (FTCS-13), pp. 48-55, 1983.
[3] B. Bhargava, S.R. Lian, and P.J. Leu, "Experimental Evaluation of Concurrent Checkpointing and Rollback-Recovery Algorithms," Proc. Int'l Conf. Data Eng., pp. 182-189, 1990.
[4] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[5] F. Cristian and F. Jahanian, "A Timestamp-Based Checkpointing Protocol for Long-Lived Distributed Computations," Proc. IEEE Symp. Reliable Distributed Systems, pp. 12-20, 1991.
[6] Y. Deng and E.K. Park, "Checkpointing and Rollback-Recovery Algorithms in Distributed Systems," J. Systems and Software, pp. 59-71, Apr. 1994.
[7] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, "The Performance of Consistent Checkpointing," Proc. 11th Symp. Reliable Distributed Systems, pp. 86-95, Oct. 1992.
[8] S.T. Huang, "Detecting Termination of Distributed Computations by External Agents," Proc. Ninth Int'l Conf. Distributed Computing Systems, pp. 79-84, 1989.
[9] J.L. Kim and T. Park, "An Efficient Protocol For Checkpointing Recovery in Distributed Systems," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 8, pp. 955-960, Aug. 1993.
[10] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[11] T.H. Lai and T.H. Yang, "On Distributed Snapshots," Information Processing Letters, pp. 153-158, May 1987.
[12] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[13] P.Y. Leu and B. Bhargava, "Concurrent Robust Checkpointing and Recovery in Distributed Systems," Proc. Fourth IEEE Int'l Conf. Data Eng., pp. 154-163, 1988.
[14] D. Manivannan, R. Netzer, and Mukesh Singhal, "Finding Consistent Global Checkpoints in a Distributed Computation," IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 6, pp. 623-627, June 1997.
[15] G. Muller, M. Hue, and N. Peyrouz, "Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment," Lecture Notes in Computer Science: Proc. First European Conf. Dependable Computing (EDCC-1), pp. 491-508, Oct. 1994.
[16] R.H.B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots," IEEE Trans. Parallel and Distributed System, vol. 6, no. 2, pp. 165-169, Feb. 1995.
[17] R. Prakash and M. Singhal, "Maximal Global Snapshot with Concurrent Initiators," Proc. Sixth IEEE Symp. Parallel and Distributed Processing, pp. 344-351, Oct. 1994.
[18] R. Prakash and M. Singhal, "Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems," IEEE Trans. Parallel and Distributed System, vol. 7, no. 10, pp. 1,035-1,048, Oct. 1996.
[19] P. Ramanathan and K.G. Shin, "Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System," IEEE Trans. Software Eng., vol. 19, no. 6, pp. 571-583, June 1993.
[20] L.M. Silva and J.G. Silva, "Global Checkpointing for Distributed Programs," Proc. 11th Symp. Reliable Distributed Systems, pp. 155-162, Oct. 1992.
[21] M. Spezialetti and P. Kearns, "Efficient Distributed Snapshots," Proc. Sixth Int'l Conf. Distributed Computing Systems, pp. 382-388, 1986.
[22] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[23] Y. Wang, "Maximum and Minimum Consistent Global Checkpoints and Their Application," Proc. 14th IEEE Symp. Reliable Distributed Systems, pp. 86-95, Oct. 1995.
[24] Y. Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints," IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, Apr. 1997.
[25] Z. Wojcik and B.E. Wojcik, "Fault Tolerant Distributed Computing Using Atomic Send Receive Checkpoints," Proc. Second IEEE Symp. Parallel and Distributed Processing, pp. 215-222, 1990.

Index Terms:
Distributed system, coordinated checkpointing, causal dependence, nonblocking, consistent checkpoints.
Citation:
Guohong Cao, Mukesh Singhal, "On Coordinated Checkpointing in Distributed Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 12, pp. 1213-1225, Dec. 1998, doi:10.1109/71.737697
Usage of this product signifies your acceptance of the Terms of Use.