This Article 
 Bibliographic References 
 Add to: 
An Optimal Retry Policy Based on Fault Classification
September 1994 (vol. 43 no. 9)
pp. 1014-1025

An optimal retry policy in a computer system is usually derived under the unrealistic assumption that fault characteristics are known a priori and remain unchanged throughout the mission lifetime. In such a case, the optimal retry period depends only upon the system's status at the time of fault detection. We propose to remedy this deficiency by formulating the optimal retry problem as a Bayesian decision problem where not only the time of fault detection but also the results of earlier retries are used to estimate the current fault characteristics. Previous knowledge about fault characteristics is represented by the prior distributions of fault-related parameters which are updated whenever new samples are obtained from retry and detection mechanisms. A new fault classification scheme is proposed to assign a temporal fault type (i.e. permanent, intermittent or transient) to each detected fault so that the corresponding fault parameters can be estimated. The estimated fault parameters are then used to derive the optimal retry period that minimizes the mean task completion time. Efficient algorithms are developed to determine the optimal retry period online upon detection of each fault. To evaluate the goodness of the proposed retry policy, it is compared with, and is always found to outperform, a number of fixed retry period policies.

[1] K. G. Shin and Y.-H. Lee, "Error detection process--model, design, and its impact on computer performance,"IEEE Trans. Comput., vol. C-33, pp. 529-540, June 1984.
[2] T. Anderson and P. A. Lee,Fault Tolerance Principles and Practice. London: Prentice-Hall International, 1981.
[3] D. P. Siewiorek and R. S. Swarz,The Theory and Practice of Reliable System Design. Bedford, MA: Digital Equipment Corporation, 1982.
[4] D. P. Siewiorek, V. Kini, H. Mashburn, S. R. McConnel, and M. M. Tsao, "A case study of c.mmp, cm*, and c.vmp: Part i--Experiences with fault tolerance in multiprocessor systems,"Proc. IEEE, vol. PROC- 66, pp. 1178-1199, Oct. 1978.
[5] O. Tasar and V. Tasar, "A study of intermittent fault in digital computers," inProc. Nat. Comput. Conf., June 1977, pp. 807-811.
[6] L. A. Boone, H. L. Liebergot, and R. M. Sedmak, "Availaiblity, reliability, and maintainability aspects of the sperry univac 1100/60," inDig. of Papers, FTCS-10, June 1980, pp. 3-9.
[7] W. C. Carter, "A short survey of some aspects of hardware design techniques for fault tolerance," IBM Res. Rep. RC-10811, IBM, Yorktown Heights, NY, 1984.
[8] D. L. Droulette, "Recovery through programming system/360-system/ 370," inProc. 1971 AFIPS Conf., vol. 38, Spring 1971, pp. 467-476.
[9] G. H. Maestri, "The retryable processor," inProc. 1972 AFIPS Conf., vol. 41, Fall 1972, pp. 273-277.
[10] M. Berg and I. Koren, "On switching policies for modular redundancy fault-tolerant computing systems,"IEEE Trans. Comput., vol. C-36, no. 9, pp. 1052-1062, Sept. 1987.
[11] J. Koren, Z. Koren, and S.Su, "Analysis of a class of recovery procedures,"IEEE Trans. Comput., vol. C-35, pp. 703-712, 1986.
[12] Y.-H. Lee and K. G. Shin, "Optimal design and use of retry in fault-tolerant computer systems,"J. ACM, vol. 35, pp. 45-69, Jan. 1988.
[13] T.-H. Lin and K. G. Shin, "A bayesian approach to fault classification,"Performance Evaluation Review, vol. 18, no. 1, pp. 58-66, 1990.
[14] J. O. Berger,Statistical Decision Theory, Foundations, Concepts, Methods, 2nd ed. New York: Springer-Verlag, 1985.
[15] M. H. DeGroot,Optimal Statistical Decisions. New York: McGraw-Hill, 1970.

Index Terms:
minimisation; system recovery; failure analysis; Bayes methods; decision theory; parameter estimation; fault tolerant computing; error detection; optimal retry policy; fault classification; fault characteristics; mission lifetime; system status; fault detection; Bayesian decision problem; prior distributions; fault-related parameter updating; temporal fault type; permanent faults; intermittent faults; transient faults; fault parameter estimation; optimal retry period; mean task completion time minimization; error recovery.
Tein-Hsiang Lin, K.G. Shin, "An Optimal Retry Policy Based on Fault Classification," IEEE Transactions on Computers, vol. 43, no. 9, pp. 1014-1025, Sept. 1994, doi:10.1109/12.312112
Usage of this product signifies your acceptance of the Terms of Use.