|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
| ASCII Text | x | ||
| Yi-Min Wang, Yennun Huang, W. Kent Fuchs, Chandra Kintala, Gaurav Suri, "Progressive Retry for Software Failure Recovery in Message-Passing Applications," IEEE Transactions on Computers, vol. 46, no. 10, pp. 1137-1141, October, 1997. | |||
| BibTex | x | ||
| @article{ 10.1109/12.628398, author = {Yi-Min Wang and Yennun Huang and W. Kent Fuchs and Chandra Kintala and Gaurav Suri}, title = {Progressive Retry for Software Failure Recovery in Message-Passing Applications}, journal ={IEEE Transactions on Computers}, volume = {46}, number = {10}, issn = {0018-9340}, year = {1997}, pages = {1137-1141}, doi = {http://doi.ieeecomputersociety.org/10.1109/12.628398}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - JOUR JO - IEEE Transactions on Computers TI - Progressive Retry for Software Failure Recovery in Message-Passing Applications IS - 10 SN - 0018-9340 SP1137 EP1141 EPD - 1137-1141 A1 - Yi-Min Wang, A1 - Yennun Huang, A1 - W. Kent Fuchs, A1 - Chandra Kintala, A1 - Gaurav Suri, PY - 1997 KW - Fault tolerance KW - distributed systems KW - protocols KW - checkpointing KW - logging KW - rollback recovery KW - message reordering KW - recovery escalation KW - telecommunication systems. VL - 46 JA - IEEE Transactions on Computers ER - | |||
Abstract—A method of execution retry for bypassing software faults in message-passing applications is described in this paper. Based on the techniques of checkpointing and message logging, we demonstrate the use of message replaying and message reordering as two mechanisms for achieving localized and fast recovery. The approach gradually increases the rollback distance and the number of affected processes when a previous retry fails, and is therefore named
[1] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques, Morgan Kauffman, 1993.
[2] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Trans. Reliability, vol. 39, no. 4, pp. 409-418, Oct. 1990.
[3] M. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability—A Study of Field Failures in Operating Systems," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 2-9, 1991.
[4] D. Jewett, “Integrity S2: A Fault-Tolerant Unix Platform,” Proc. 21st Int'l Symp. Fault-Tolerant Computing (FTCS-21), pp. 512-519, Montreal, June 1991.
[5] J. Gray and D.P. Siewiorek, "High-Availability Computer Systems," Computer, pp. 39-48, Sept. 1991.
[6] A. Avizienis, "The N-Version Approach to Fault-Tolerant Software," IEEE Trans. Software Eng., vol. 11, no. 12, pp. 1,491-1,501, Dec. 1985.
[7] B. Randell, "System Structure for Software Fault Tolerance," IEEE Trans. Software Eng., vol. 1, no. 2, pp. 220-232, June 1975.
[8] P.E. Ammann and J.C. Knight, “Data Diversity: An Approach to Software Fault Tolerance,” IEEE Trans. Computers, vol. 37, no. 4, pp. 418-425, Apr. 1988.
[9] J. Bartlett, “A NonStop Kernel,” Proc. ACM Symp.Operating Systems Principles, ACM Press, New York, 1981, pp. 22‐29.
[10] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle, "Fault Tolerance Under UNIX," ACM Trans. Computer Systems, vol. 7, no. 1, pp. 1-24, Feb. 1989.
[11] E. Adams, "Optimizing Preventive Service of Software Products," IBM J. Research and Development, no. 1, pp. 2-14, Jan. 1984.
[12] I. Lee and R.K. Iyer, “Faults, Symptoms, and Software Fault Tolerance in Tandem GUARDIAN90 Operating System,” Proc. 23rd IEEE Int'l Symp. Fault-Tolerant Computing (FTCS23), pp. 20-29, Toulouse, France 1993.
[13] Y. Huang and C. Kintala, "Software Implemented Fault Tolerance: Technologies and Experience," Proc. IEEE Fault-Tolerant Computing Symp., pp. 2-9, June 1993.
[14] Y.M. Wang, Y. Huang, and W.K. Fuchs, "Progressive Retry for Software Error Recovery in Distributed Systems," Proc. IEEE Fault Tolerant Computing Symp., pp. 138-144, June 1993.
[15] E.N. Elnozahy, D.B. Johnson, and Y.M. Wang, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," Technical Report no. CMU-CS-96-181, Dept. of Computer Science, Carnegie Mellon Univ. (also available at ftp://ftp.cs.cmu.edu/user/mootaz/papers/S.ps), 1996.
[16] R.E. Strom, D.F. Bacon, and S.A. Yemini, “Volatile Logging inn-Fault-Tolerant Distributed Systems,” Proc. Third Ann. Int'l Symp. Fault-Tolerant Computing, pp. 44-49, 1988.
[17] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[18] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[19] Y.M. Wang and W.K. Fuchs, “Lazy Checkpoint Coordination for Bounding Rollback Propagation,” Proc. 12th Symp. Reliable Distributed Systems, pp. 78-85, 1993.
[20] G. Fowler, Y. Huang, D. Korn, and H. Rao, "A User-Level Replicated File System," Proc. Summer '93 USENIX, pp. 279-290, June 1993.
[21] G. Suri, Y. Huang, Y.M. Wang, W.K. Fuchs, and C. Kintala, "An Implementation and Performance Measurement of the Progressive Retry Technique," Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 41-48, Apr. 1995.

