This Article 
 Bibliographic References 
 Add to: 
Efficient Rollback-Recovery Technique in Distributed Computing Systems
June 1996 (vol. 7 no. 6)
pp. 565-577

Abstract—In this paper we propose a new approach for implementing rollback-recovery in a distributed computing system. A concept of logical ring is introduced for the maintenance of information required for consistent recovery from a system crash. Message processing order of a process is kept by all other processes on its logical ring. Transmission of data messages are accompanied by the circulation of the associated order messages on the ring. The sizes of the order messages are small. In addition, redundant transmission of order information is avoided, thereby reducing the communication overhead incurred during failure-free operation. Furthermore, updating of the order information and garbage collection task are simplified in the proposed mechanism. Our approach does not require information about message processing order be written to stable storage; in fact, the time-consuming operations of saving information in stable storage are confined to the checkpointing activities. When failures occur, a surviving process need roll back only if some preceding order information is totally lost, which is relatively unlikely considering the ever growing speed of communication networks. It is shown that a system can recover correctly as long as there exists at least one surviving process.

[1] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle, "Fault Tolerance Under UNIX," ACM Trans. Computer Systems, vol. 7, no. 1, pp. 1-24, Feb. 1989.
[2] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[3] G. Coulouris, J. Dollimore, and T. Kindberg, Distributed Systems, second ed. Addison-Wesley, 1994.
[4] F. Cristian and F. Jahanian, "A Timestamp-Based Checkpointing Protocol for Long-Lived Distributed Computations," Proc. IEEE Symp. Reliable Distributed Systems, pp. 12-20, 1991.
[5] E.N. Elnozahy and W. Zwaenepoel, “Manetho—Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Trans. Computers, vol. 41, no. 5, pp. 526–531, May 1992.
[6] D.B. Johnson and W. Zwaenepoel, "Sender-Based Message Logging," Proc. Conf. Fault-Tolerant Computing Systems, pp. 14-19, 1987.
[7] D. B. Johnson and W. Zwaenepoel,“Recovery in distributed systems using optimistic message logging and checkpointing,”J. Algorithms, vol. 11, pp. 462–491, 1990.
[8] T.-Y. Juang and S. Venkatesan, "Efficient Algorithms for Crash Recovery in Distributed Systems," Proc. 10th Int'l Conf. Foundations of Software Technology and Theoretical Computer Science, pp. 349-361, 1990.
[9] T.Y. Juang and S. Venkatesan, “Crash Recovery with Little Overhead,” Proc. 11th Int'l Conf. Distributed Computing Systems, pp. 454-461, June 1987.
[10] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[11] M. L. Powell and D. L. Presotto,“Publishing: A reliable broadcast communication mechanism,”inProc. 9th ACM Symp. Oper. Syst. Princip., 1983, pp. 100–109.
[12] R. D. Schlichting and F. B. Schneider,“Fail-stop processors: An approach to designing fault-tolerant computing systems,”ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222–238, Aug. 1983.
[13] S.K. Shrivastava, P.D. Ezhilchelvan, N.A. Speirs, S. Tao, and A. Tully, “Principal Features of the VOLTAN Family of Reliable Node Architectures for Distributed Systems,” IEEE Trans. Computers, vol. 41, no. 5, pp. 542–549, May 1992.
[14] A. P. Sistla and J. L. Welch,“Efficient distributed recovery using message logging,”inProc. 8th ACM Symp. Princip. Distrib. Comput., 1989, pp. 223–238.
[15] R.E. Strom, D.F. Bacon, and S.A. Yemini, “Volatile Logging inn-Fault-Tolerant Distributed Systems,” Proc. Third Ann. Int'l Symp. Fault-Tolerant Computing, pp. 44-49, 1988.
[16] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[17] A.S. Tannenbaum, Computer Networks.Englewood Cliffs, N.J.: Prentice Hall, 1981.

Index Terms:
Checkpoint, crash recovery, distributed computing systems, fault tolerance, logical ring, rollback.
Ge-Ming Chiu, Cheng-Ru Young, "Efficient Rollback-Recovery Technique in Distributed Computing Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 6, pp. 565-577, June 1996, doi:10.1109/71.506695
Usage of this product signifies your acceptance of the Terms of Use.