loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages
October-December 2005 (vol. 2 no. 4)
pp. 287-296
Communication-induced checkpointing (CIC) protocols can be used to prevent the domino effect. Such protocols that belong to the index-based category were shown to have a better performance. In this paper, we propose an efficient index-based CIC protocol. The fully informed (FI) protocol proposed in the literature has been known to be the best index-based CIC protocol that one can achieve since the optimal protocol needs to acquire the future information. We discover that the enhancement adopted by such a protocol rarely takes effect in practice. By discarding this enhancement, we obtain a new protocol, called NMMP. Simulation results show that our protocol is almost as efficient as FI in some typical computational environments. Especially, we demonstrate that the two protocols have the same behavior over a tree communication network. Surprisingly, NMMP only has to piggyback on each message control information of constant size, regardless of the number of processes.

[1] Y.M. Wang, A. Lowry, and W.K. Fuchs, “Consistent Global Checkpoints Based on Direct Dependency Tracking,” Information Processing Letters, vol. 50, no. 4, pp. 223-230, May 1994.
[2] K.M. Chandy and L. Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems,” ACM Trans. Computing Systems, vol. 3, no. 1, pp. 63-75, Feb. 1985.
[3] B. Randell, “System Structure for Software Fault-Tolerance,” IEEE Trans. Software Eng., vol. 1, no. 2, pp. 220-232, June 1975.
[4] E.N. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, Sept. 2002.
[5] A. Mostefaoui, J.M. Helary, R.H.B. Netzer, and M. Raynal, “Communication-Based Prevention of Useless Checkpoints in Distributed Computations,” Distributed Computing, vol. 13, no. 1, pp. 29-43, Jan. 2000.
[6] R. Koo and S. Toueg, “Checkpointing and Rollback-Recovery for Distributed Systems,” IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[7] B. Janssens and W.K. Fuchs, “Experimental Evaluation of Multiprocessor Cache-Based Error Recovery,” Proc. Int'l Conf. Parallel Processing, pp. 505-508, 1991.
[8] L. Alvisi, E. Elnozahy, S. Rao, S.A. Husain, and A. De Mel, “An Analysis of Communication-Induced Checkpointing,” Proc. IEEE Fault-Tolerant Computing Symp., pp. 242-249, 1999.
[9] D. Briatico, A. Ciufoletti, and L. Simoncini, “A Distributed Domino-Effect Free Recovery Algorithm,” Proc. Fourth IEEE Symp. Reliability in Distributed Software and Database Systems, pp. 207-215, Oct. 1984.
[10] D. Manivannan and M. Singhal, “A Low Overhead Recovery Technique Using Quasi-Synchronous Checkpointing,” Proc. 16th IEEE Int'l Conf. Distributed Computing Systems, pp. 100-107, May 1996.
[11] G.M.D. Vieira, I.C. Garcia, and L.E. Buzato, “Systematic Analysis of Index-Based Checkpointing Algorithms Using Simulation,” Proc. Ninth Brazilian Symp. Fault-Tolerant Computing, pp. 31-41, 2001.
[12] R. Baldoni, F. Quaglia, and P. Fornara, “An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 2, pp. 181-192, Feb. 1999.
[13] I.C. Garcia and L.E. Buzato, “Checkpointing Using Local Knowledge about Recovery Lines,” Technical Report, TR-IC-99-22, Univ. of Campinas, Brazil, 1999.
[14] F. Quaglia, R. Baldoni, and B. Ciciani, “On the No-Z-Cycle Property in Distributed Executions,” J. Computer and Systems Sciences, vol. 61, no. 3, pp. 400-427, Dec. 2000.
[15] D.L. Russell, “State Restoration in Systems of Communicating Processes,” IEEE Trans. Software Eng., vol. 6, no. 2, pp. 183-194, Mar. 1980.
[16] F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, 1991.
[17] D. Doval and D. O'Mahony, “Overlay Networks: A Scalable Alternative for P2P,” IEEE Internet Computing, vol. 7, no. 7, pp. 79-82, July 2003.
[18] D. Comer and D. Stevens, Internetworking with TCP/IP, Vol. III: Client-Server Programming and Applications, Linux/Posix Sockets Version. Prentice Hall, 2001.
[19] L. Lamport, “Time, Clocks and the Ordering of Events in a Distributed System,” Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[20] R.H.B. Netzer and J. Xu, “Necessary and Sufficient Conditions for Consistent Global Snapshots,” IEEE Trans. Parallel and Distributed System, vol. 6, no. 2, pp. 165-169, Feb. 1995.
[21] R. Baldoni, J.M. Helary, and M. Raynal, “Rollback-Dependency Trackability: Visible Characterizations,” Proc. 18th ACM Symp. Principles of Distributed Computers, pp. 33-42, May 1999.
[22] J.M. Helary, A. Mostefaoui, and M. Raynal, “Virtual Precedence in Asynchronous Systems: Concept and Applications,” Proc. 11th Int'l Workshop Distributed Algorithms, pp. 170-184, Sept. 1997.
[23] Y.M. Wang, “Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints,” IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, Apr. 1997.
[24] C.J. Fidge, “Logical Time in Distributed Computing Systems,” Computer, vol. 24, no. 8, pp. 11-76, 1991.
[25] J. Tsai, Y.M. Wang, and S.Y. Kuo, “Evaluations of Domino-Free Communication-Induced Checkpointing Protocols,” Information Processing Letters, vol. 69, pp. 31-37, Jan. 1999.
[26] NASA Ames, Research Center, NAS Parallel Benchmarks 2.3, http://science.nas.nasa.gov/SoftwareNPB/, 1997.

Index Terms:
Index Terms- Distributed systems, fault tolerance, domino effect, communication-induced checkpointing, index-based protocols.
Citation:
Jichiang Tsai, "An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages," IEEE Transactions on Dependable and Secure Computing, vol. 2, no. 4, pp. 287-296, Oct.-Dec. 2005, doi:10.1109/TDSC.2005.42
Usage of this product signifies your acceptance of the Terms of Use.