This Article 
 Bibliographic References 
 Add to: 
The Cost of Recovery in Message Logging Protocols
March/April 2000 (vol. 12 no. 2)
pp. 160-173

Abstract—Past research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic, and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. Our results suggest that applications face a complex trade-off when choosing a message logging protocol for fault tolerance. On the one hand, optimistic protocols can provide fast failure-free execution and good performance during recovery, but are complex to implement and can create orphan processes. On the other hand, orphan-free protocols either risk being slow during recovery, e.g., sender-based pessimistic and causal protocols, or incur a substantial overhead during failure-free execution, e.g., receiver-based pessimistic protocols. To address this trade-off, we propose hybrid logging protocols, a new class of orphan-free protocols. We show that hybrid protocols perform within two percent of causal logging during failure-free execution and within two percent of receiver-based logging during recovery.

[1] L. Alvisi and K. Marzullo, “Tradeoffs in Implementing Optimal Message Logging Protocols,” Proc. Fifth ACM Symp. Principles of Distributed Computing, pp. 58-67, June 1996.
[2] L. Alvisi and K. Marzullo, “Message Logging: Pessimistic, Optimistic, Causal, and Optimal,” IEEE Trans. Software Eng., vol. 24, no. 2, pp. 149-159, Feb. 1998.
[3] A. Borg, J. Baumbach, and S. Glazer, “A Message System Supporting Fault Tolerance,” Proc. Symp. ACM SIGOPS Operating Systems Principles, pp. 90-99, Oct. 1983.
[4] R. Butler and E. Lusk, “Monitors, Message, and Clusters: The p4 Parallel Programming System,” Parallel Computing, vol. 20, pp. 547-564, Apr. 1994.
[5] “NAS Parallel Benchmarks,” NASA Ames Research Center,, 1997.
[6] O.P. Damani and V.K. Garg, “How to Recover Efficiently and Asynchronously when Optimism Fails,” Proc. 16th Int'l Conf. Distributed Computing Systems, pp. 108-115, 1996.
[7] E.N. Elnozahy, “On the Relevance of Communication Costs of Rollback-Recovery Protocols,” Proc. 14th Ann. ACM Symp. Principles of Distributed Computing, pp. 74-79, Aug. 1995.
[8] E.N. Elnozahy and W. Zwaenepoel, “Manetho—Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Trans. Computers, vol. 41, no. 5, pp. 526–531, May 1992.
[9] E.N. Elnozahy and W. Zwaenepoel, “On the Use and Implementation of Message Logging, Digest of Papers: 24th Ann. Int'l Symp. Fault-Tolerant Computing, June 1994.
[10] D.B. Johnson, “Distributed System Fault Tolerance Using Message Logging and Checkpointing,” PhD thesis, report no. COMPTR89-101, Rice Univ., Dec. 1989.
[11] D.B. Johnson and W. Zwaenepoel, “Sender-Based Message Logging,” Digest of Papers: 17th Ann. Int'l Symp. Fault-Tolerant Computing, June 1987.
[12] T.Y. Juang and S. Venkatesan, “Crash Recovery with Little Overhead,” Proc. 11th Int'l Conf. Distributed Computing Systems, pp. 454-461, June 1987.
[13] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[14] F. Mattern, “Virtual Time and Global States of Distributed Systems,” Parallel and Distributed Algorithms, M. Cosnard et. al., eds., Elsevir Science Publishers B.V., 1989.
[15] J.R. Mitchell and V.K. Garg, “A Non-Blocking Recovery Algorithm for Causal Message Logging,” Proc. 17th Symp. Reliable Distributed Systems, West Lafayette, Ind., pp. 3-9, Oct. 1998.
[16] S. Rao, L. Alvisi, and H.M. Vin, “Egida: An Extensible Toolkit for Low-Overhead Fault-Tolerance,” Proc. IEEE Fault-Tolerant Computing Symp. FTCS-29, Madison, Wis., pp. 48-55, June 1999.
[17] F.B. Schneider, "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial," ACM Computing Surveys, vol. 22, no. 4, pp. 299-319, Dec. 1990.
[18] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, “MPI: The Complete Reference,” MIT Press,, 1995.
[19] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[20] R.E. Strom, D.F. Bacon, and S.A. Yemini, “Volatile Logging inn-Fault-Tolerant Distributed Systems,” Proc. Third Ann. Int'l Symp. Fault-Tolerant Computing, pp. 44-49, 1988.

Index Terms:
Distributed computing, fault tolerance, log-based rollback recovery, pessimistic protocols, optimistic protocols, causal protocols, hybrid protocols.
Sriram Rao, Lorenzo Alvisi, Harrick M. Vin, "The Cost of Recovery in Message Logging Protocols," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 2, pp. 160-173, March-April 2000, doi:10.1109/69.842260
Usage of this product signifies your acceptance of the Terms of Use.