This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Consistency Issues in Distributed Checkpoints
March/April 1999 (vol. 25 no. 2)
pp. 274-281

Abstract—A global checkpoint is a set of local checkpoints, one per process. The traditional consistency criterion for global checkpoints states that a global checkpoint is consistent if it does not include messages received and not sent. This paper investigates other consistency criteria, transitlessness, and strong consistency. A global checkpoint is transitless if it does not exhibit messages sent and not received. Transitlessness can be seen as a dual of traditional consistency. Strong consistency is the addition of transitlessness to traditional consistency. The main result of this paper is a statement of the necessary and sufficient condition answering the following question: "Given an arbitrary set of local checkpoints, can this set be extended to a global checkpoint that satisfies$\cal P$" (where $\cal P$ is traditional consistency, transitlessness, or strong consistency). From a practical point of view, this condition, when applied to transitlessness, is particularly interesting as it helps characterize which messages do not need to be recorded by checkpointing protocols.

[1] Ö. Babao$\breve{\rm g}$lu, E. Fromentin, and M. Raynal, "Unified Framework for Expressing and Detecting Run-Time Properties of Distributed Computations," J. Systems and Software, special issue on Software Eng. for Distributed Computing, vol. 33, no. 3, pp. 287-298, June 1996.
[2] R. Baldoni, J.M. Helary, A. Mostefaoui, and M. Raynal, "A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability," Proc. IEEE Int'l Symp. Fault Tolerant Computing, pp. 68-77, 1997.
[3] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[4] E.N. Elnozahy, D.B. Johnson, and Y.M. Wang, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," Technical Report CMU-CS-96-181, Carnegie Mellon Univ., 1996.
[5] J. Fowler and W. Zwaenepoel, "Causal Distributed Breakpoints," Proc. 10th Int'l Conf. Distributed Computing Systems, pp. 134-141, 1990.
[6] J.M. Hélary, C. Jard, N. Plouzeau, and M. Raynal, "Detection of Stable Properties in Distributed Applications," Proc. Sixth ACM Symp. Principles of Distributed Computing, pp. 125-136,Vancouver, 1987.
[7] J.M. Helary, A. Mostefaoui, and M. Raynal, "Communication-Induced Determination of Consistent Snapshots," Proc. Int'l Symp. Fault-Tolerant Computing, FTCS-28, pp. 208-217,Munich, Germany, June 1998.
[8] D. B. Johnson and W. Zwaenepoel,“Recovery in distributed systems using optimistic message logging and checkpointing,”J. Algorithms, vol. 11, pp. 462–491, 1990.
[9] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[10] T.H. Lai and T.H. Yang, "On Distributed Snapshots," Information Processing Letters, pp. 153-158, May 1987.
[11] D. Manivannan, R. Netzer, and Mukesh Singhal, "Finding Consistent Global Checkpoints in a Distributed Computation," IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 6, pp. 623-627, June 1997.
[12] B.P. Miller and J. Choi, "Breakpoints and Halting in Distributed Programs," in Proc. Int'l Conf. Distributed Computing Systems, IEEE CS Press, 1988, pp. 316-323.
[13] A. Mostefaoui and M. Raynal, "Efficient Message Logging for Uncoordinated Checkpointing Protocols," Proc. Second European Dependable Computing Conf. (EDCC'2), pp. 353-364, Lecture Notes In Computer Science 1150, Taormina, Italy, Springer-Verlag, Oct. 1996.
[14] R.H.B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots," IEEE Trans. Parallel and Distributed System, vol. 6, no. 2, pp. 165-169, Feb. 1995.
[15] R.H.B. Netzer and Y. Xu, “Replaying Distributed Programs without Message Logging,” Proc Sixth Int'l IEEE Symp. High Performance Distributed Computing (HPDC6), Portland, Ore., Aug. 1997.
[16] R. Schwarz and F. Mattern, "Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail," Distributed Computing, vol. 7, pp. 149-174, 1994.
[17] Y.M. Wang and W.K. Fuchs, "Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems," Proc. IEEE Symp. Reliable Distributed Systems, Oct. 1992.
[18] Y.-M. Wang and W.K. Fuchs, "Optimal Message Log Reclamation for Uncoordinated Checkpointing," Fault-Tolerant Parallel and Distributed Systems, Pradhan, D.R. Avresky, eds., IEEE CS Press, Los Alamitos Calif., 1995.
[19] Y. Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints," IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, Apr. 1997.

Index Terms:
Checkpointing, consistency, strong consistency, transitlessness, distributed systems, fault-tolerance, rollback recovery.
Citation:
Jean-Michel Hélary, Robert H.B. Netzer, Michel Raynal, "Consistency Issues in Distributed Checkpoints," IEEE Transactions on Software Engineering, vol. 25, no. 2, pp. 274-281, March-April 1999, doi:10.1109/32.761450
Usage of this product signifies your acceptance of the Terms of Use.