This Article 
 Bibliographic References 
 Add to: 
Progressive Retry for Software Failure Recovery in Message-Passing Applications
October 1997 (vol. 46 no. 10)
pp. 1137-1141

Abstract—A method of execution retry for bypassing software faults in message-passing applications is described in this paper. Based on the techniques of checkpointing and message logging, we demonstrate the use of message replaying and message reordering as two mechanisms for achieving localized and fast recovery. The approach gradually increases the rollback distance and the number of affected processes when a previous retry fails, and is therefore named progressive retry. Examples from telecommunications software systems and performance measurements from an application-level implementation are described to illustrate the benefits of the scheme.

[1] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques, Morgan Kauffman, 1993.
[2] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Trans. Reliability, vol. 39, no. 4, pp. 409-418, Oct. 1990.
[3] M. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability—A Study of Field Failures in Operating Systems," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 2-9, 1991.
[4] D. Jewett, “Integrity S2: A Fault-Tolerant Unix Platform,” Proc. 21st Int'l Symp. Fault-Tolerant Computing (FTCS-21), pp. 512-519, Montreal, June 1991.
[5] J. Gray and D.P. Siewiorek, "High-Availability Computer Systems," Computer, pp. 39-48, Sept. 1991.
[6] A. Avizienis, "The N-Version Approach to Fault-Tolerant Software," IEEE Trans. Software Eng., vol. 11, no. 12, pp. 1,491-1,501, Dec. 1985.
[7] B. Randell, "System Structure for Software Fault Tolerance," IEEE Trans. Software Eng., vol. 1, no. 2, pp. 220-232, June 1975.
[8] P.E. Ammann and J.C. Knight, “Data Diversity: An Approach to Software Fault Tolerance,” IEEE Trans. Computers, vol. 37, no. 4, pp. 418-425, Apr. 1988.
[9] J. Bartlett, “A NonStop Kernel,” Proc. ACM Symp.Operating Systems Principles, ACM Press, New York, 1981, pp. 22‐29.
[10] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle, "Fault Tolerance Under UNIX," ACM Trans. Computer Systems, vol. 7, no. 1, pp. 1-24, Feb. 1989.
[11] E. Adams, "Optimizing Preventive Service of Software Products," IBM J. Research and Development, no. 1, pp. 2-14, Jan. 1984.
[12] I. Lee and R.K. Iyer, “Faults, Symptoms, and Software Fault Tolerance in Tandem GUARDIAN90 Operating System,” Proc. 23rd IEEE Int'l Symp. Fault-Tolerant Computing (FTCS23), pp. 20-29, Toulouse, France 1993.
[13] Y. Huang and C. Kintala, "Software Implemented Fault Tolerance: Technologies and Experience," Proc. IEEE Fault-Tolerant Computing Symp., pp. 2-9, June 1993.
[14] Y.M. Wang, Y. Huang, and W.K. Fuchs, "Progressive Retry for Software Error Recovery in Distributed Systems," Proc. IEEE Fault Tolerant Computing Symp., pp. 138-144, June 1993.
[15] E.N. Elnozahy, D.B. Johnson, and Y.M. Wang, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," Technical Report no. CMU-CS-96-181, Dept. of Computer Science, Carnegie Mellon Univ. (also available at, 1996.
[16] R.E. Strom, D.F. Bacon, and S.A. Yemini, “Volatile Logging inn-Fault-Tolerant Distributed Systems,” Proc. Third Ann. Int'l Symp. Fault-Tolerant Computing, pp. 44-49, 1988.
[17] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[18] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[19] Y.M. Wang and W.K. Fuchs, “Lazy Checkpoint Coordination for Bounding Rollback Propagation,” Proc. 12th Symp. Reliable Distributed Systems, pp. 78-85, 1993.
[20] G. Fowler, Y. Huang, D. Korn, and H. Rao, "A User-Level Replicated File System," Proc. Summer '93 USENIX, pp. 279-290, June 1993.
[21] G. Suri, Y. Huang, Y.M. Wang, W.K. Fuchs, and C. Kintala, "An Implementation and Performance Measurement of the Progressive Retry Technique," Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 41-48, Apr. 1995.

Index Terms:
Fault tolerance, distributed systems, protocols, checkpointing, logging, rollback recovery, message reordering, recovery escalation, telecommunication systems.
Yi-Min Wang, Yennun Huang, W. Kent Fuchs, Chandra Kintala, Gaurav Suri, "Progressive Retry for Software Failure Recovery in Message-Passing Applications," IEEE Transactions on Computers, vol. 46, no. 10, pp. 1137-1141, Oct. 1997, doi:10.1109/12.628398
Usage of this product signifies your acceptance of the Terms of Use.