This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
The Cost of Recovery in Message Logging Protocols
March/April 2000 (vol. 12 no. 2)
pp. 160-173

Abstract—Past research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic, and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. Our results suggest that applications face a complex trade-off when choosing a message logging protocol for fault tolerance. On the one hand, optimistic protocols can provide fast failure-free execution and good performance during recovery, but are complex to implement and can create orphan processes. On the other hand, orphan-free protocols either risk being slow during recovery, e.g., sender-based pessimistic and causal protocols, or incur a substantial overhead during failure-free execution, e.g., receiver-based pessimistic protocols. To address this trade-off, we propose hybrid logging protocols, a new class of orphan-free protocols. We show that hybrid protocols perform within two percent of causal logging during failure-free execution and within two percent of receiver-based logging during recovery.

[1] L. Alvisi and K. Marzullo, “Tradeoffs in Implementing Optimal Message Logging Protocols,” Proc. Fifth ACM Symp. Principles of Distributed Computing, pp. 58-67, June 1996.
[2] L. Alvisi and K. Marzullo, “Message Logging: Pessimistic, Optimistic, Causal, and Optimal,” IEEE Trans. Software Eng., vol. 24, no. 2, pp. 149-159, Feb. 1998.
[3] A. Borg, J. Baumbach, and S. Glazer, “A Message System Supporting Fault Tolerance,” Proc. Symp. ACM SIGOPS Operating Systems Principles, pp. 90-99, Oct. 1983.
[4] R. Butler and E. Lusk, “Monitors, Message, and Clusters: The p4 Parallel Programming System,” Parallel Computing, vol. 20, pp. 547-564, Apr. 1994.
[5] “NAS Parallel Benchmarks,” NASA Ames Research Center,http://science.nas.nasa.gov/SoftwareNPB/, 1997.
[6] O.P. Damani and V.K. Garg, “How to Recover Efficiently and Asynchronously when Optimism Fails,” Proc. 16th Int'l Conf. Distributed Computing Systems, pp. 108-115, 1996.
[7] E.N. Elnozahy, “On the Relevance of Communication Costs of Rollback-Recovery Protocols,” Proc. 14th Ann. ACM Symp. Principles of Distributed Computing, pp. 74-79, Aug. 1995.
[8] E.N. Elnozahy and W. Zwaenepoel, “Manetho—Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Trans. Computers, vol. 41, no. 5, pp. 526–531, May 1992.
[9] E.N. Elnozahy and W. Zwaenepoel, “On the Use and Implementation of Message Logging, Digest of Papers: 24th Ann. Int'l Symp. Fault-Tolerant Computing, June 1994.
[10] D.B. Johnson, “Distributed System Fault Tolerance Using Message Logging and Checkpointing,” PhD thesis, report no. COMPTR89-101, Rice Univ., Dec. 1989.
[11] D.B. Johnson and W. Zwaenepoel, “Sender-Based Message Logging,” Digest of Papers: 17th Ann. Int'l Symp. Fault-Tolerant Computing, June 1987.
[12] T.Y. Juang and S. Venkatesan, “Crash Recovery with Little Overhead,” Proc. 11th Int'l Conf. Distributed Computing Systems, pp. 454-461, June 1987.
[13] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[14] F. Mattern, “Virtual Time and Global States of Distributed Systems,” Parallel and Distributed Algorithms, M. Cosnard et. al., eds., Elsevir Science Publishers B.V., 1989.
[15] J.R. Mitchell and V.K. Garg, “A Non-Blocking Recovery Algorithm for Causal Message Logging,” Proc. 17th Symp. Reliable Distributed Systems, West Lafayette, Ind., pp. 3-9, Oct. 1998.
[16] S. Rao, L. Alvisi, and H.M. Vin, “Egida: An Extensible Toolkit for Low-Overhead Fault-Tolerance,” Proc. IEEE Fault-Tolerant Computing Symp. FTCS-29, Madison, Wis., pp. 48-55, June 1999.
[17] F.B. Schneider, "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial," ACM Computing Surveys, vol. 22, no. 4, pp. 299-319, Dec. 1990.
[18] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, “MPI: The Complete Reference,” MIT Press,, 1995.
[19] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[20] R.E. Strom, D.F. Bacon, and S.A. Yemini, “Volatile Logging inn-Fault-Tolerant Distributed Systems,” Proc. Third Ann. Int'l Symp. Fault-Tolerant Computing, pp. 44-49, 1988.

Index Terms:
Distributed computing, fault tolerance, log-based rollback recovery, pessimistic protocols, optimistic protocols, causal protocols, hybrid protocols.
Citation:
Sriram Rao, Lorenzo Alvisi, Harrick M. Vin, "The Cost of Recovery in Message Logging Protocols," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 2, pp. 160-173, March-April 2000, doi:10.1109/69.842260
Usage of this product signifies your acceptance of the Terms of Use.