This Article 
 Bibliographic References 
 Add to: 
Determination of an Optimal Retry Time in Multiple-Module Computing Systems
March 1996 (vol. 45 no. 3)
pp. 374-379

Abstract—The 'optimal' (in some sense) amount of time used for (or the optimal number of times) retrying an instruction upon detection of an error in a computing system is usually determined under the assumption that the system is composed of a single module, within which all fault activities are confined until some module-replacement action is taken. However, a computing system is usually composed of at least three modules, namely, CPU, memory, and I/O, and the execution of an instruction often requires the cooperation of two or more modules. It is thus more realistic to consider the fault activities in multiple-module systems.

In this paper, we first relax the single-module assumption and model the fault activities in a multiple-module system as a Markov process. We apply the randomization method to decompose the continuous-time Markov chain into a discrete-time Markov chain subordinated to a Poisson process. Using this decomposition, we can derive several interesting measures, such as 1) the conditional probability of successful retry given a retry period and the fact that a non-permanent fault has occurred, 2) the mean time-to-system recovery, and 3) the distribution of the time until which all modules in the system enter a fault-free state. All the measures derived are used to determine, along with the parameters characterizing fault activities and costs of recovery techniques, a) whether or not retry should be used as a first-step recovery means upon detection of an error, and b) the best retry period or number of retries that satisfies a given criterion, e.g., a specific probability of successful retry.

[1] K.G. Shin and Y.-H. Lee,"Error detection process—Model, design, and its impact on computer performance," IEEE Trans. Computers, vol. 33, no. 6, pp. 529-540, June 1984.
[2] I. Koren and Z. Koren, "Analysis of a Class of Recovery Procedures," IEEE Trans. Computers, vol. 35, no. 8, pp. 703-712, Aug. 1986.
[3] M. Berg and I. Koren, "On Switching Policies for Modular Redundancy Fault-Tolerant Computing Systems," IEEE Trans. Computers, vol. 36, no. 9, pp. 1,052-1,062, Sept. 1987.
[4] D.P. Siewiorek and R.S. Swarz,The Theory and Practice of Reliable System Design. Digital Press, 1982.
[5] A.M. Saleh and J.H. Patel,"Transient-fault analysis for retry techniques," IEEE Trans. Reliability, vol. 37, no. 3, pp. 323-330, Aug. 1988.
[6] Y.H. Lee and K.G. Shin, "Optimal Design and Use of Retry in Fault-Tolerant Computing Systems," J. ACM, vol. 35, pp. 45-69, Jan. 1988.
[7] T.-H Lin and K.G. Shin, "An Optimal Retry Policy Based on Fault Classification," IEEE Trans. Computers, vol. 43, no. 9, pp. 1,014-1,025, Sept. 1994.
[8] W.K. Grassman,"Transient solutions in Markovian queueing systems," Computers and Operations Research, vol. 4, pp. 47-53, 1977.
[9] D. Gross and D.R. Miller,"The randomization technique as a modeling tool and solution procedure for transient markov processes," Operations Research, vol. 32, no. 2, Mar.-Apr. 1984.
[10] K.G. Shin and T.-H. Lin, "Modeling and Measurement of Error Propagation in a Multi-Module Computing System," IEEE Trans. Computers, vol. 37, no. 9, pp. 1,053-1,066, Sept. 1988.
[11] B. Melamed and M. Yadin,"Randomization procedures in the computation of cumulative-time distributions over discrete state Markov processes," Operations Research., vol. 32, no. 4, pp. 926-944, July-Aug. 1984.

Index Terms:
Fault-tolerance, error recovery, instruction retry, Markov models, randomization.
Chao-Ju Hou, Kang G. Shin, "Determination of an Optimal Retry Time in Multiple-Module Computing Systems," IEEE Transactions on Computers, vol. 45, no. 3, pp. 374-379, March 1996, doi:10.1109/12.485576
Usage of this product signifies your acceptance of the Terms of Use.