This Article 
 Bibliographic References 
 Add to: 
Maximizing Mean-Time to Failure in k-Resilient Systems with Repair
February 1997 (vol. 46 no. 2)
pp. 229-234

Abstract—A k-resilient system with N components can tolerate up to k component failures and still function correctly. We consider k-resilient systems where the number of component failures is a constant fraction of the total number of components, that is $k={\textstyle{N \over c}},$ and c is a constant such that 2 ≤c < ∞. Under a Markovian assumption of constant failure and repair rates, we compute the system size Nmax at which the mean-time to failure (MTTF) for such a system is maximized. Our results indicate that Nmax can be expressed in terms of constant c and parameter ρ as $N_{max}={\textstyle{{K(c,\rho )} \over \rho }},$ where $\rho ={\textstyle{\lambda \over \mu }}$ and K(c, ρ) is a function of c, ρ. In addition, we have found that the variation of Nmax over the whole range of c is remarkably small, and as a result, even if the resilience k of a system as a function of N varies widely, the system size at which the MTTF is maximized is within the range

$${{0.36} \over {\rho }}\ {\schmi {\bf and}}\ {{0.5} \over {\rho }}.$$We validate our results through event-driven simulation, and, in addition, examine the behavior of systems with Weibull distributed failure times.

[1] K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Prentice Hall, 1982.
[2] Y.C. Tay, "The Reliability of (k, n)-Resilient Distributed Systems," Proc. Fourth Symp. Reliability in Distributed Software and Database Systems, pp. 119-122, Oct. 1984.
[3] R.H. Thomas, “A Majority Consensus Approach to Concurrency Control,” ACM Trans. Database Systems, vol. 4, no. 2, pp. 180-209, June 1979.
[4] L. Lamport, R. Shostak, and M. Pease, "The Byzantine Generals Problem," ACM Trans. Programming Languages and Systems, vol. 4, no. 3, July 1982, pp. 382-401.
[5] D. Siewiorek and R. Swarz, Reliable Computer Systems: Design and Evaluation. Digital Press, 1992.
[6] H. Schwetman, CSIM Reference Manual (Revision 16).Austin, Tex.: Microelectronics and Computer Technology Corporation.

Index Terms:
Mean time to failure, k-resilient systems, Weibull distribution, Markov chains.
José Fridman, Sampath Rangarajan, "Maximizing Mean-Time to Failure in k-Resilient Systems with Repair," IEEE Transactions on Computers, vol. 46, no. 2, pp. 229-234, Feb. 1997, doi:10.1109/12.565606
Usage of this product signifies your acceptance of the Terms of Use.