
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Guohong Cao, Mukesh Singhal, "On Coordinated Checkpointing in Distributed Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 12, pp. 12131225, December, 1998.  
BibTex  x  
@article{ 10.1109/71.737697, author = {Guohong Cao and Mukesh Singhal}, title = {On Coordinated Checkpointing in Distributed Systems}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {9}, number = {12}, issn = {10459219}, year = {1998}, pages = {12131225}, doi = {http://doi.ieeecomputersociety.org/10.1109/71.737697}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  On Coordinated Checkpointing in Distributed Systems IS  12 SN  10459219 SP1213 EP1225 EPD  12131225 A1  Guohong Cao, A1  Mukesh Singhal, PY  1998 KW  Distributed system KW  coordinated checkpointing KW  causal dependence KW  nonblocking KW  consistent checkpoints. VL  9 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the PrakashSinghal algorithm [18] combined them. In other words, the PrakashSinghal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: There does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems.
[1] A. Acharya and B.R. Badrinath, "Checkpointing Distributed Applications on Mobil Computers," Proc. Third Int'l Conf. Parallel and Distributed Information Systems, Sept. 1994.
[2] G. Barigazzi and L. Strigini, "ApplicationTransparent Setting of Recovery Points," Digest of Papers, Proc. 13th Fault Tolerant Computing Symp. (FTCS13), pp. 4855, 1983.
[3] B. Bhargava, S.R. Lian, and P.J. Leu, "Experimental Evaluation of Concurrent Checkpointing and RollbackRecovery Algorithms," Proc. Int'l Conf. Data Eng., pp. 182189, 1990.
[4] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[5] F. Cristian and F. Jahanian, "A TimestampBased Checkpointing Protocol for LongLived Distributed Computations," Proc. IEEE Symp. Reliable Distributed Systems, pp. 1220, 1991.
[6] Y. Deng and E.K. Park, "Checkpointing and RollbackRecovery Algorithms in Distributed Systems," J. Systems and Software, pp. 5971, Apr. 1994.
[7] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, "The Performance of Consistent Checkpointing," Proc. 11th Symp. Reliable Distributed Systems, pp. 8695, Oct. 1992.
[8] S.T. Huang, "Detecting Termination of Distributed Computations by External Agents," Proc. Ninth Int'l Conf. Distributed Computing Systems, pp. 7984, 1989.
[9] J.L. Kim and T. Park, "An Efficient Protocol For Checkpointing Recovery in Distributed Systems," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 8, pp. 955960, Aug. 1993.
[10] R. Koo and S. Toueg, "Checkpointing and RollbackRecovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 2331, Jan. 1987.
[11] T.H. Lai and T.H. Yang, "On Distributed Snapshots," Information Processing Letters, pp. 153158, May 1987.
[12] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558565, July 1978.
[13] P.Y. Leu and B. Bhargava, "Concurrent Robust Checkpointing and Recovery in Distributed Systems," Proc. Fourth IEEE Int'l Conf. Data Eng., pp. 154163, 1988.
[14] D. Manivannan, R. Netzer, and Mukesh Singhal, "Finding Consistent Global Checkpoints in a Distributed Computation," IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 6, pp. 623627, June 1997.
[15] G. Muller, M. Hue, and N. Peyrouz, "Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment," Lecture Notes in Computer Science: Proc. First European Conf. Dependable Computing (EDCC1), pp. 491508, Oct. 1994.
[16] R.H.B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots," IEEE Trans. Parallel and Distributed System, vol. 6, no. 2, pp. 165169, Feb. 1995.
[17] R. Prakash and M. Singhal, "Maximal Global Snapshot with Concurrent Initiators," Proc. Sixth IEEE Symp. Parallel and Distributed Processing, pp. 344351, Oct. 1994.
[18] R. Prakash and M. Singhal, "LowCost Checkpointing and Failure Recovery in Mobile Computing Systems," IEEE Trans. Parallel and Distributed System, vol. 7, no. 10, pp. 1,0351,048, Oct. 1996.
[19] P. Ramanathan and K.G. Shin, "Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System," IEEE Trans. Software Eng., vol. 19, no. 6, pp. 571583, June 1993.
[20] L.M. Silva and J.G. Silva, "Global Checkpointing for Distributed Programs," Proc. 11th Symp. Reliable Distributed Systems, pp. 155162, Oct. 1992.
[21] M. Spezialetti and P. Kearns, "Efficient Distributed Snapshots," Proc. Sixth Int'l Conf. Distributed Computing Systems, pp. 382388, 1986.
[22] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204226, Aug. 1985.
[23] Y. Wang, "Maximum and Minimum Consistent Global Checkpoints and Their Application," Proc. 14th IEEE Symp. Reliable Distributed Systems, pp. 8695, Oct. 1995.
[24] Y. Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints," IEEE Trans. Computers, vol. 46, no. 4, pp. 456468, Apr. 1997.
[25] Z. Wojcik and B.E. Wojcik, "Fault Tolerant Distributed Computing Using Atomic Send Receive Checkpoints," Proc. Second IEEE Symp. Parallel and Distributed Processing, pp. 215222, 1990.