This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Finding Consistent Global Checkpoints in a Distributed Computation
June 1997 (vol. 8 no. 6)
pp. 623-627

Abstract—Consistent global checkpoints have many uses in distributed computations. A central question in applications that use consistent global checkpoints is to determine whether a consistent global checkpoint that includes a given set of local checkpoints can exist. Netzer and Xu [16] presented the necessary and sufficient conditions under which such a consistent global checkpoint can exist, but they did not explore what checkpoints could be constructed. In this paper, we prove exactly which local checkpoints can be used for constructing such consistent global checkpoints. We illustrate the use of our results with a simple and elegant algorithm to enumerate all such consistent global checkpoints.

[1] O. Babaoglu and K. Marzullo, "Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms," Distributed Systems, S. J. Mullender, ed., pp. 55-96. Addison-Wesley, 1993.
[2] R. Baldoni, J.M. Helary, A. Mostefaoui, and M. Raynal, "Characterizing Consistent Checkpoints in Large-Scale Distributed Systems," Proc. Fifth IEEE Int'l Conf. Parallel and Distributed Computing, pp. 314-323,Chejiu Islands, South Korea, Aug. 1995.
[3] R. Baldoni, J.M. Helary, A. Mostefaoui, and M. Raynal, "Consistent Checkpoints in Message Passing Distributed Systems," Rapporte de Recherche No. 2564, INRIA, France, June 1995.
[4] R. Baldoni, J.M. Helary, A. Mostefaoui, and M. Raynal, "On Modeling Consistent Checkpoints and the Domino Effect in Distributed Systems," Rapporte de Recherche No. 2569, INRIA, France, June 1995.
[5] R. Baldoni, J.M. Helary, and M. Raynal, "About Recording in Asynchronous Computations," Proc. 15th ACM Symp. Principles of Distributed Computing, p. 55, May 1996.
[6] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[7] R. Cooper and K. Marzullo, "Consistent Detection of Global Predicates," in Proc. Workshop Parallel and Distributed Debugging, ACM Press, New York, pp. 163-173.
[8] E. Fromentin, N. Plouzeau, and M. Raynal, "An Introduction to the Analysis and Debug of Distributed Computations," Proc. First IEEE Int'l Conf. Algorithms and Architectures for Parallel Processing, pp. 545-554,Brisbane, Australia, Apr. 1995.
[9] K. Geihs and M. Seifert, "Automated Validation of a Co-operation Protocol for Distributed Systems," Proc. Sixth Int'l Conf. Distributed Computing Systems, pp. 436-443, 1986.
[10] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[11] A.D. Kshemkalyani, M. Raynal, and M. Singhal, "An Introduction to Snapshot Algorithms in Distributed Computing," Distributed Systems Eng. J., vol. 2, no. 4, pp. 224-233, Dec. 1995.
[12] A.D. Kshemkalyani and M. Singhal, "Efficient Detection and Resolution of Generalized Distributed Deadlocks," IEEE Trans. Software Eng., vol. 20, no. 1, pp. 43-54, Jan. 1994.
[13] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[14] L. Lamport, "The Mutual Exclusion Problem: Part I—A Theory of Interprocess Communication," J. ACM, vol. 33, no. 2, pp. 313-326, Apr. 1986.
[15] B.P. Miller and J. Choi, "Breakpoints and Halting in Distributed Programs," in Proc. Int'l Conf. Distributed Computing Systems, IEEE CS Press, 1988, pp. 316-323.
[16] R.H.B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots," IEEE Trans. Parallel and Distributed System, vol. 6, no. 2, pp. 165-169, Feb. 1995.
[17] M. Spezialetti and P. Kearns, "Simultaneous Regions: A Framework for the Consistent Monitoring of Distributed Systems," Proc. Ninth Int'l Conf. Distributed Computing Systems, pp. 61-68, 1989.
[18] Y. Wang, "Maximum and Minimum Consistent Global Checkpoints and Their Application," Proc. 14th IEEE Symp. Reliable Distributed Systems, pp. 86-95, Oct. 1995.
[19] Y. Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints," IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, Apr. 1997.
[20] J. Xu and R.H.B. Netzer, “Adaptive Independent Checkpointing for Reducing Rollback Propagation,” Proc. IEEE Parallel and Distributed Processing Symp., pp. 754-761, Dec. 1993.

Index Terms:
Causality, distributed checkpointing, consistent global states, failure recovery, fault tolerance.
Citation:
D. Manivannan, Robert H. B. Netzer, Mukesh Singhal, "Finding Consistent Global Checkpoints in a Distributed Computation," IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 6, pp. 623-627, June 1997, doi:10.1109/71.595580
Usage of this product signifies your acceptance of the Terms of Use.