This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
May 1992 (vol. 41 no. 5)
pp. 526-531

Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and, fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery scheme.

[1] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle, "Fault tolerance under UNIX,"ACM Trans. Comput. Syst., vol. 7, no. 1, pp. 1-24, Feb. 1989.
[2] K. M. Chandy and L. Lamport, "Distributed snapshots: Determining global states of distributed systems,"ACM Trans. Comput. Syst., vol. 3, no. 1, pp. 63-75, Feb. 1985.
[3] E. N. Elnozahy and W. Zwaenepoel, "Manetho: A low overhead rollback-recovery system with fast output commit," Tech. Rep. TR91- 152, Rice Univ., Mar. 1991.
[4] F. Jahanian and F. Cristian, "A timestamp-based checkpointing protocol for long-lived distributed computations," inProc. 10th Symp. Reliable Distributed Syst., Bologna, Italy, Sept. 1991, pp. 12-20.
[5] D. B. Johnson and W. Zwaenepoel, "Sender-based message logging," inProc. 17th Int. Symp. Fault-Tolerant Comput., June 1987, pp. 14-19.
[6] D. B. Johnson and W. Zwaenepoel, "Recovery in distributed systems using optimistic message logging and checkpointing,"J. Algorithms, vol. 11, no. 3, pp. 462-491, Sept. 1990.
[7] T. Juang and S. Venkatesan, "Crash recovery with little overhead," inProc. 11th Int. Conf. Distributed Comput. Syst., May 1991, pp. 454-461.
[8] R. Koo and S. Toueg, "Checkpointing and rollback-recovery for distributed systems,"IEEE Trans. Software Eng., vol. SE-13, pp. 23-31, Jan. 1987.
[9] L. Lamport, "Time, clocks, and the ordering of events in a distributed system,"Commun. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[10] K. Li, J. F. Naughton, and J. S. Plank, "Checkpointing multicomputer applications," inProc. 10th Symp. Reliable Distributed Syst., Oct. 1991, pp. 2-11.
[11] L. L. Peterson, N. Buchholz, and R. D. Schlichting, "Preserving and using context information in interprocess communication,"ACM Trans. Comput. Syst., vol. 7, no. 3, pp. 217-246, Aug. 1989.
[12] M. L. Powell and D. L. Presotto, "Publishing: A reliable broadcast communication mechanism," inProc. 9th ACM Symp. Operat. Syst. Principles, Oct. 1983, pp. 100-109.
[13] B. Randell, "System structure for software fault tolerance,"IEEE Trans. Software Eng., vol. SE-1, pp. 220-232, June 1975.
[14] R. D. Schlichting and F.B. Schneider, "Fail-stop processors: An approach to designing fault-tolerant computing systems,"ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222-238, Aug. 1983.
[15] A. P. Sistla and J. L. Welch, "Efficient distributed recovery using message logging," inProc. 8th Annu. ACM Symp. Principles Distributed Comput., Aug. 1989, pp. 223-238.
[16] R. E. Strom and S. Yemini, "Optimistic recovery in distributed systems,"ACM Trans. Comput. Syst., vol. 3, no. 3, pp. 204-226, Aug. 1985.
[17] K.-L. Wu and W. K. Fuchs, "Recoverable distributed shared memory,"IEEE Trans. Comput., vol. 39, pp. 460-469, Apr. 1990.
[18] K.-L. Wu, W. K. Fuchs, and J. H. Patel, "Error recovery in shared memory multiprocessors using private caches,"IEEE Trans. Parallel Distributed Syst., vol. 1, pp. 231-240, Apr. 1990.

Index Terms:
Manetho; transparent rollback-recovery protocol; distributed computations; antecedence graph maintenance; uncoordinated checkpointing; sender-based message logging; pessimistic message logging; output commit; optimistic message logging; failure-free overhead; fault tolerant computing; graph theory.
Citation:
E.N. Elnozahy, W. Zwaenepoel, "Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit," IEEE Transactions on Computers, vol. 41, no. 5, pp. 526-531, May 1992, doi:10.1109/12.142678
Usage of this product signifies your acceptance of the Terms of Use.