This Article 
 Bibliographic References 
 Add to: 
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
February 1998 (vol. 24 no. 2)
pp. 149-159

Abstract—Message-logging protocols are an integral part of a popular technique for implementing processes that can recover from crash failures. All message-logging protocols require that, when recovery is complete, there be no orphan processes, which are surviving processes whose states are inconsistent with the recovered state of a crashed process. We give a precise specification of the consistency property "no orphan processes." From this specification, we describe how different existing classes of message-logging protocols (namely optimistic, pessimistic, and a class that we call causal) implement this property. We then propose a set of metrics to evaluate the performance of message-logging protocols, and characterize the protocols that are optimal with respect to these metrics. Finally, starting from a protocol that relies on causal delivery order, we show how to derive optimal causal protocols that tolerate f overlapping failures and recoveries for a parameter f : 1 ≤fn.

[1] L. Alvisi, B. Hoppe, and K. Marzullo, "Nonblocking and Orphan-Free Message Logging Protocols," Digest of Papers: The 23rd Int'l Symp. Fault-Tolerant Computing, pp. 145-154, 1993.
[2] L. Alvisi, "Understanding the Message Logging Paradigm for Masking Process Crashes," PhD thesis, Cornell Univ., Dept. of Computer Science, Jan. 1996. Available as Technical Report TR-96-1577.
[3] L. Alvisi and K. Marzullo, “Tradeoffs in Implementing Optimal Message Logging Protocols,” Proc. Fifth ACM Symp. Principles of Distributed Computing, pp. 58-67, June 1996.
[4] A. Borg, J. Baumbach, and S. Glazer, “A Message System Supporting Fault Tolerance,” Proc. Symp. ACM SIGOPS Operating Systems Principles, pp. 90-99, Oct. 1983.
[5] K. Birman and T. Joseph, "Reliable Communications in Presence of Failures," ACM Trans. Computing Systems, vol. 5, no. 1, pp. 47-76, 1987.
[6] K. Birman, A. Schiper, and P. Stephenson, “Lightweight Causal and Atomic Group Multicast,” ACM Trans. Computer Systems, vol. 9, no. 3, pp. 272-314, Aug. 1991.
[7] E.N. Elnozahy and W. Zwaenepoel, “Manetho—Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Trans. Computers, vol. 41, no. 5, pp. 526–531, May 1992.
[8] E.N. Elnozahy and W. Zwaenepoel, “On the Use and Implementation of Message Logging, Digest of Papers: 24th Ann. Int'l Symp. Fault-Tolerant Computing, June 1994.
[9] J.N. Gray, "Notes on Database Operating Systems" Operating Systems: An Advanced Course, R. Bayer, R.M. Graham, and G. Seegmuller, eds., Lecture Notes in Computer Science 60, Springer-Verlag, Heidelberg, Germany, 1978.
[10] T.Y. Juang and S. Venkatesan, “Crash Recovery with Little Overhead,” Proc. 11th Int'l Conf. Distributed Computing Systems, pp. 454-461, June 1987.
[11] D.B. Johnson and W. Zwaenepoel, "Sender-Based Message Logging," Digest of Papers: 17th Ann. Int'l Symp. Fault-Tolerant Computing, pp. 14-19, IEEE Computer Society, June 1987.
[12] D. B. Johnson and W. Zwaenepoel,“Recovery in distributed systems using optimistic message logging and checkpointing,”J. Algorithms, vol. 11, pp. 462–491, 1990.
[13] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[14] A. Pnueli, "The Temporal Logic of Programs," Proc. 18th Ann. Symp. Foundations of Computer Science, pp. 46-57, Nov. 1977.
[15] M. L. Powell and D. L. Presotto,“Publishing: A reliable broadcast communication mechanism,”inProc. 9th ACM Symp. Oper. Syst. Princip., 1983, pp. 100–109.
[16] M. Raynal, A. Schiper, and S. Toueg, "Causal Ordering Abstraction and a Simple Way to Implement It," Information Processing Letters, vol. 39, no. 6, pp. 343-350, 1991.
[17] R.E. Strom, D.F. Bacon, and S.A. Yemini, “Volatile Logging inn-Fault-Tolerant Distributed Systems,” Proc. Third Ann. Int'l Symp. Fault-Tolerant Computing, pp. 44-49, 1988.
[18] F. Schneider, "Byzantine Generals in Action: Implementing Fail-stop Processors," ACM Trans. Computing, vol. 2, no. 2, pp. 145-154, 1984.
[19] A. Sandoz and A. Schiper, "A Characterization of Consisting Distributed Snapshops Using Causal Order," Technical Report TR92-14, Departement d'Informatique, Ecole Politechnique Fédérale de Lausanne, 1992.
[20] A. P. Sistla and J. L. Welch,“Efficient distributed recovery using message logging,”inProc. 8th ACM Symp. Princip. Distrib. Comput., 1989, pp. 223–238.
[21] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[22] S. Venkatesan and T.Y. Juang, "Efficient Algorithms for Optimistic Crash Recovery," Distributed Computing," vol. 8, no. 2, pp. 105-114, June 1994.

Index Terms:
Message logging, optimistic protocols, pessimistic protocols, checkpoint-restart protocols, resilient processes, specification of fault-tolerance techniques.
Lorenzo Alvisi, Keith Marzullo, "Message Logging: Pessimistic, Optimistic, Causal, and Optimal," IEEE Transactions on Software Engineering, vol. 24, no. 2, pp. 149-159, Feb. 1998, doi:10.1109/32.666828
Usage of this product signifies your acceptance of the Terms of Use.