Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
May 1992 (vol. 41 no. 5)
pp. 526-531
DOI Bookmark:
http://doi.ieeecomputersociety.org/10.1109/12.142678
Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and, fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery scheme. [1] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle, "Fault tolerance under UNIX,"ACM Trans. Comput. Syst., vol. 7, no. 1, pp. 1-24, Feb. 1989.[2] K. M. Chandy and L. Lamport, "Distributed snapshots: Determining global states of distributed systems,"ACM Trans. Comput. Syst., vol. 3, no. 1, pp. 63-75, Feb. 1985.[3] E. N. Elnozahy and W. Zwaenepoel, "Manetho: A low overhead rollback-recovery system with fast output commit," Tech. Rep. TR91- 152, Rice Univ., Mar. 1991.[4] F. Jahanian and F. Cristian, "A timestamp-based checkpointing protocol for long-lived distributed computations," inProc. 10th Symp. Reliable Distributed Syst., Bologna, Italy, Sept. 1991, pp. 12-20.[5] D. B. Johnson and W. Zwaenepoel, "Sender-based message logging," inProc. 17th Int. Symp. Fault-Tolerant Comput., June 1987, pp. 14-19.[6] D. B. Johnson and W. Zwaenepoel, "Recovery in distributed systems using optimistic message logging and checkpointing,"J. Algorithms, vol. 11, no. 3, pp. 462-491, Sept. 1990.[7] T. Juang and S. Venkatesan, "Crash recovery with little overhead," inProc. 11th Int. Conf. Distributed Comput. Syst., May 1991, pp. 454-461.[8] R. Koo and S. Toueg, "Checkpointing and rollback-recovery for distributed systems,"IEEE Trans. Software Eng., vol. SE-13, pp. 23-31, Jan. 1987.[9] L. Lamport, "Time, clocks, and the ordering of events in a distributed system,"Commun. ACM, vol. 21, no. 7, pp. 558-565, July 1978.[10] K. Li, J. F. Naughton, and J. S. Plank, "Checkpointing multicomputer applications," inProc. 10th Symp. Reliable Distributed Syst., Oct. 1991, pp. 2-11.[11] L. L. Peterson, N. Buchholz, and R. D. Schlichting, "Preserving and using context information in interprocess communication,"ACM Trans. Comput. Syst., vol. 7, no. 3, pp. 217-246, Aug. 1989.[12] M. L. Powell and D. L. Presotto, "Publishing: A reliable broadcast communication mechanism," inProc. 9th ACM Symp. Operat. Syst. Principles, Oct. 1983, pp. 100-109.[13] B. Randell, "System structure for software fault tolerance,"IEEE Trans. Software Eng., vol. SE-1, pp. 220-232, June 1975.[14] R. D. Schlichting and F.B. Schneider, "Fail-stop processors: An approach to designing fault-tolerant computing systems,"ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222-238, Aug. 1983.[15] A. P. Sistla and J. L. Welch, "Efficient distributed recovery using message logging," inProc. 8th Annu. ACM Symp. Principles Distributed Comput., Aug. 1989, pp. 223-238.[16] R. E. Strom and S. Yemini, "Optimistic recovery in distributed systems,"ACM Trans. Comput. Syst., vol. 3, no. 3, pp. 204-226, Aug. 1985.[17] K.-L. Wu and W. K. Fuchs, "Recoverable distributed shared memory,"IEEE Trans. Comput., vol. 39, pp. 460-469, Apr. 1990.[18] K.-L. Wu, W. K. Fuchs, and J. H. Patel, "Error recovery in shared memory multiprocessors using private caches,"IEEE Trans. Parallel Distributed Syst., vol. 1, pp. 231-240, Apr. 1990.
Index Terms:
Manetho; transparent rollback-recovery protocol; distributed computations; antecedence graph maintenance; uncoordinated checkpointing; sender-based message logging; pessimistic message logging; output commit; optimistic message logging; failure-free overhead; fault tolerant computing; graph theory.
Citation:
E.N. Elnozahy, W. Zwaenepoel, "Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit," IEEE Transactions on Computers, vol. 41, no. 5, pp. 526-531, May 1992, doi:10.1109/12.142678
Usage of this product signifies your acceptance of the
Terms of Use.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||