This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
On Properties of RDT Communication-Induced Checkpointing Protocols
August 2003 (vol. 14 no. 8)
pp. 755-764

AbstractRollback-Dependency Trackability (RDT) is a property stating that all rollback dependencies between local checkpoints are online trackable by using a transitive dependency vector. The most crucial RDT characterizations introduced in the literature can be represented as certain types of RDT-PXCM-paths. Here, let the U-path and V-path be any two types of RDT-PXCM-paths. In this paper, we investigate several properties of communication-induced checkpointing protocols that ensure the RDT property. First, we prove that if an online RDT protocol encounters a U-path at a point of a checkpoint and communication pattern associated with a distributed computation, it also encounters a V-path there. Moreover, if this encountered U-path is invisibly doubled, the corresponding encountered V-path is invisibly doubled as well. Therefore, we can conclude that breaking all invisibly doubled U-paths is equivalent to breaking all invisibly doubled V-paths for an online RDT protocol. Next, we continue to demonstrate that a visibly doubled U-path must contain a doubled U-cycle in the causal past. These results can further deduce that some different checkpointing protocols actually have the same behavior for all possible patterns. Finally, we present a commendatory systematic technique for comparing the performance of online RDT protocols.

[1] Y. M. Wang, A. Lowry, and W. K. Fuchs,“Consistent global checkpoints based on direct dependency tracking,”to appear inInform. Process. Lett., vol. 50, no. 4, pp. 223–230, May 1994.
[2] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[3] B. Randell, System Structure for Software Fault-Tolerant IEEE Trans. Software Eng., vol. 1, no. 2, pp. 220-232, June 1975.
[4] E.N. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson, A Survey of Rollback-Recovery Protocols in Message-Passing Systems ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, Sept. 2002.
[5] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[6] B. Janssens and W.K. Fuchs, Experimental Evaluation of Multiprocessor Cache-Based Error Recovery Proc. Int'l Conf. Parallel Processing, no. 1, pp. 505-508, 1991.
[7] Y. Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints," IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, Apr. 1997.
[8] R.H.B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots," IEEE Trans. Parallel and Distributed System, vol. 6, no. 2, pp. 165-169, Feb. 1995.
[9] R. Baldoni, J.M. Helary, A. Mostefaoui, and M. Raynal, "A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability," Proc. IEEE Int'l Symp. Fault Tolerant Computing, pp. 68-77, 1997.
[10] J. Tsai, S.Y. Kuo, and Y.M. Wang, Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 963-971, Oct. 1998.
[11] D. Manivannan and M. Singhal, Quasi-Synchronous Checkpointing: Models, Characterization, and Classification IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 7, pp. 703-713, July 1999.
[12] R. Baldoni, J.M. Helary, and M. Raynal, Rollback-Dependency Trackability: Visible Characterizations Proc. 18th ACM Symp. Principles of Distributed Computing, pp. 33-42, May 1999.
[13] R. Baldoni, J.M. Helary, and M. Raynal, Rollback-Dependency Trackability: A Minimal Characterization and Its Protocol Information and Computation, vol. 165, no. 2, pp. 144-173, Mar. 2001.
[14] R. Baldoni, J.M. Helary, and M. Raynal, Impossibility of Scalar Clock-Based Communication-Induced Checkpointing Protocols Ensuring the RDT Property Information Processing Letters, vol. 80, no. 2, pp. 105-111, Oct. 2001.
[15] I.C. Garcia and L.E. Buzato, On the Minimal Characterization of the Rollback-Dependency Trackability Property Proc. 21st IEEE Int'l Conf. Distributed Computing Systems, pp. 342-349, Apr. 2001.
[16] J. Tsai, S.Y. Kuo, and Y.M. Wang, Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 963-971, Oct. 1998.
[17] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[18] I.C. Garcia, G.M.D. Vieira, and L.E. Buzato, RDT-Partner: An Efficient Checkpointing Protocol that Enforces Rollback-Dependency Trackability Proc. 19th Brazilian Symp. Computer Networks, May 2001.
[19] I.C. Garcia and L.E. Buzato, A Linear Approach to Enforce the Minimal Characterization of the Rollback-Dependency Trackability Property Technical Report IC-01-17, Univ. of Campinas, Brazil, Dec. 2001.
[20] A. Mostefaoui, J.M. Helary, R.H.B. Netzer, and M. Raynal, Communication-Based Prevention of Useless Checkpoints in Distributed Computations Distributed Computing, vol. 13, no. 1, pp. 29-43, Jan. 2000.
[21] G.M.D. Vieira, I.C. Garcia, and L.E. Buzato, Systematic Analysis of Index-Based Checkpointing Algorithms Using Simulation Proc. IX Brazilian Symp. Fault-Tolerant Computing, 2001.
[22] F. Quaglia, R. Baldoni, and B. Ciciani, On the No-Z-Cycle Property in Distributed Executions J. Computer and Systems Sciences, vol. 61, no. 3, pp. 400-427, Dec. 2000.
[23] I.C. Garcia and L.E. Buzato, Checkpointing Using Local Knowledge about Recovery Lines Technical Report, TR-IC-99-22, Univ. of Campinas, Brazil, 1999.
[24] L. Alvisi, E. Elnozahy, S. Rao, S.A. Husain, and A. De Mel, An Analysis of Communication-Induced Checkpointing Proc. IEEE Fault-Tolerant Computing Symp., pp. 242-249, 1999.
[25] J. Tsai and J.W. Lin, On Characteristics of DEF Communication-Induced Checkpointing Protocols Proc. Pacific Rim Int'l Symp. Dependable Computing, pp. 29-36, 2002.

Index Terms:
Distributed systems, fault tolerance, rollback-dependency trackability, communication-induced checkpointing protocols, rollback-recovery.
Citation:
Jichiang Tsai, "On Properties of RDT Communication-Induced Checkpointing Protocols," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 8, pp. 755-764, Aug. 2003, doi:10.1109/TPDS.2003.1225055
Usage of this product signifies your acceptance of the Terms of Use.