This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Communication-Induced Determination of Consistent Snapshots
September 1999 (vol. 10 no. 9)
pp. 865-877

Abstract—A classical way to determine consistent snapshots consists in using Chandy-Lamport's algorithm. This algorithm relies on specific control messages that allow processes to synchronize local checkpoint determination and message recording in order for the resulting snapshot to be consistent. This paper investigates a communication-induced approach to determine consistent snapshots. In such an approach, control information is carried out by application messages. Two abstract necessary and sufficient conditions are stated: one associated with global checkpoint consistency, the other associated with message recording. A general protocol is derived from these abstract conditions. Actually, this general protocol can be instantiated in distinct ways, giving rise to a family of communication-induced snapshot protocols. This general protocol shows there is an intrinsic trade-off between the number of forced checkpoints and the number of recorded messages. Finally, a particular instantiation of the general protocol is provided.

[1] A. Acharya and B.R. Badrinath, "Checkpointing Distributed Applications on Mobil Computers," Proc. Third Int'l Conf. Parallel and Distributed Information Systems, Sept. 1994.
[2] Ö. Babao$\breve{\rm g}$lu, E. Fromentin, and M. Raynal, "Unified Framework for Expressing and Detecting Run-Time Properties of Distributed Computations," J. Systems and Software, special issue on Software Eng. for Distributed Computing, vol. 33, no. 3, pp. 287-298, June 1996.
[3] D. Briatico, A. Ciuffoletti, and L.A. Simoncini, “Distributed Domino-Effect Free Recovery Algorithm,” Proc. Fourth IEEE Symp. Reliability in Distributed Software and Database Systems, pp. 207-215, Md., Oct. 1984.
[4] R. Baldoni, J.M. Helary, A. Mostefaoui, and M. Raynal, "A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability," Proc. IEEE Int'l Symp. Fault Tolerant Computing, pp. 68-77, 1997.
[5] R. Baldoni, F. Quaglia, and P. Fornara, “An Index-Based Checkpointing Algorithm for Autonomous Distributed System,” Proc 16th IEEE Symp. Reliable Distributed Systems, pp. 27-34, Durham, N.C., Oct. 1997.
[6] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[7] E.N. Elnozahy, D.B. Johnson, and Y.M. Wang, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” Technical Report CMU-CS-96-181, Carnegie Mellon Univ., 1996.
[8] V.K. Garg et al., "Detecting Conjunctive Channel Predicates in a Distributed Programming Environment," in Proc. Int'l Conf. System Sciences, IEEE CS Press, 1995, pp. 232-241.
[9] V.K. Garg and B. Waldecker, "Detection of Strong Unstable Predicates in Distributed Programs, IEEE Trans. Parallel and Distributed Systems, Dec. 1996, pp. 1323-1333.
[10] A.P. Goldberg, A. Gopal, A. Lowry, and R. Strom, "Restoring Consistent Global State of Distributed Computations," Proc. ACM/ONR Workshop Parallel and Distributed Debugging, pp. 144-154,Santa Cruz, Calif., May20-21 1991.
[11] J. Helary, A. Mostefaoui, R. Netzer, and M. Raynal, “Preventing Useless Checkpoints in Distributed Computations,” Proc. Int'l Symp. Reliable Distributed Systems (SRDS), Oct. 1997.
[12] J.M. Hélary, A. Mostefaoui, and M. Raynal, “Virtual Precedence in Asynchronous Systems: Concept and Applications,” Proc. 11th Workshop Distributed Algorithms (WDAG11), pp. 170-184, Saarbrucken, Germany: Springer-Verlag, Sept. 1997.
[13] J.M. Hélary, R.H.B. Netzer, and M. Raynal, “Consistency Issues in Distributed Checkpoints,” IEEE Trans. Software Eng., vol. 25, no. 2, Mar.-Apr. 1999.
[14] M. Hurfin, N. Plouzeau, and M. Raynal, “A Debugging Tool for Distributed Estelle Programs,” J. Computer Comm., vol. 16, no. 5, pp. 328-333, May 1993.
[15] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[16] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[17] Y. Manabe, “A Distributed Consistent Global Checkpoint Algorithm with a Minimum Number of Checkpoints,” Proc. 12th Int'l Conf. Information Networking, pp. 549-554, Jan. 1998.
[18] D. Manivannan, R. Netzer, and Mukesh Singhal, "Finding Consistent Global Checkpoints in a Distributed Computation," IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 6, pp. 623-627, June 1997.
[19] D. Manivannan and M. Singhal, "A Low-Overhead Recovery Technique Using Quasi Synchronous Checkpointing," Proc. IEEE Int'l Conf. Distributed Computing Systems, pp. 100-107, 1996.
[20] D. Manivannan and M. Singhal, "A Low-Overhead Recovery Technique Using Quasi Synchronous Checkpointing," Proc. IEEE Int'l Conf. Distributed Computing Systems, pp. 100-107, 1996.
[21] R.H.B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots," IEEE Trans. Parallel and Distributed System, vol. 6, no. 2, pp. 165-169, Feb. 1995.
[22] R.H.B. Netzer and Y. Xu, “Replaying Distributed Programs without Message Logging,” Proc Sixth Int'l IEEE Symp. High Performance Distributed Computing (HPDC6), Portland, Ore., Aug. 1997.
[23] B. Randell, “System Structure for Software Fault-Tolerance,” IEEE Trans. Software Eng., vol. 1, no. 2, pp. 220-232, 1975.
[24] D.L. Russell, “State Restoration in Systems of Communicating Processes,” IEEE Trans. Software Eng., vol. 6, no. 2, pp. 183-194, 1980.
[25] Y. Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints," IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, Apr. 1997.

Index Terms:
Asynchronous distributed computation, checkpointing, communication-induced protocol, consistency, global checkpoint, message recording, snapshot.
Citation:
Jean-Michel Hélary, Achour Mostefaoui, Michel Raynal, "Communication-Induced Determination of Consistent Snapshots," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 9, pp. 865-877, Sept. 1999, doi:10.1109/71.798312
Usage of this product signifies your acceptance of the Terms of Use.