This Article 
 Bibliographic References 
 Add to: 
Probabilistic Evaluation of Online Checks in Fault-Tolerant Multiprocessor Systems
May 1992 (vol. 41 no. 5)
pp. 532-541

The analysis of fault-tolerant multiprocessor systems that use concurrent error detection (CED) schemes is much more difficult than the analysis of conventional fault-tolerant architectures. Various analytical techniques have been proposed to evaluate CED schemes deterministically. However, these approaches are based on worst-case assumptions related to the failure of system components. Often, the evaluation results do not reflect the actual fault tolerance capabilities of the system. A probabilistic approach to evaluate the fault detecting and locating capabilities of online checks. in a system is developed. The various probabilities associated with the checking schemes are identified and used in the framework of the matrix-based model. Based on these probabilistic matrices, estimates for the fault tolerance capabilities of various systems are derived analytically.

[1] V. S. S. Nair and J. A. Abraham, "A model for the analysis of fault-tolerant signal processing architectures," inProc. 32nd Int. Tech Symp. SPIE, San Diego, CA, Aug. 1988, pp. 246-257.
[2] K. H. Huang and J. A. Abraham, "Low cost schemes for fault tolerance in matrix operations with processor arrays," inProc. 12th Int. Symp. Fault-Tolerant Comput., June 21-24, 1982.
[3] J. Y. Jou and J. A. Abraham," Fault tolerant matrix arithmetic and signal processing on highly concurrent computing structures,"Proc. IEEE, vol. 74, pp. 732-741, May 1986.
[4] P. Banerjeeet al., "Algorithm-based fault tolerance on a hypercube multiprocessor,"IEEE Trans. Comput., pp. 1132-1145, Sept. 1990.
[5] F. T. Luk and H. Park, "Fault-tolerant matrix triangulation on systolic arrays,"IEEE Trans. Comput, vol. 37, pp. 1434-1438, 1988.
[6] D. Gu, J. Rosenkrantz, and S. S. Ravi, "Design and analysis of test schemes for algorithm-based fault tolerance," inProc. 20th Int. Symp. Fault-Tolerant Comput., Newcastle, England, June 26-28, 1990, pp. 106-113.
[7] P. Banerjee and J. A. Abraham, "Concurrent fault diagnosis in multiple processor systems," inProc. 16th Int. Symp. Fault-Tolerant Comput., Vienna, Austria, 1986, pp. 298-303.
[8] B. Vinnakota and N. K. Jha, "Diagnosability and diagnosis of algorithm-based fault-tolerant systems," inProc. 32nd Midwest Symp. Circuits Syst., Urbana, IL, Aug. 1989.
[9] S. N. Maheswari and S. L. Hakimi, "On models for diagnosable systems and probabilistic fault diagnosis,"IEEE Trans. Comput., vol. C-25, pp. 228-236, 1976.
[10] P. Banerjee and J. A. Abraham, "A probabilistic model of algorithm-based fault-tolerance in array processors for real-time systems," inProc. IEEE Int. Conf. Real-Time Syst., Dec. 1986.
[11] S. Rangarajan and D. Fussell, "A probabilistic method for fault diagnosis of multiprocessor systems," inProc. 18th Int. Symp. Fault-Tolerant Comput., 1988, pp. 278-283.
[12] M. Blount, "Probabilistic treatment of diagnosis in digital systems," inProc. 7th Int. Symp. Fault-Tolerant Comput., 1977, pp. 72-77.
[13] E. Scheinerman, "Almost sure fault tolerance in random graphs,"SIAM J. Comput., vol. 16, pp. 1124-1134, Dec. 1987.
[14] K. H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C-33, pp. 518-528, June 1984.
[15] V. S. S. Nair and J. A. Abraham, "A new probabilistic model for the analysis of fault-tolerant systems using concurrent error detection," inProc. SPIE, Advanced Algorithm and Architectures for Signal Processing Appl., July 1990.
[16] V. S. S. Nair, "Analysis and design of algorithm-based fault-tolerant systems," Ph.D. dissertation, Univ. of Illinois, Urbana, IL, 1990.
[17] V. S. S. Nair and J. A. Abraham, "Real number codes for fault-tolerant matrix operations on processor arrays,"IEEE Trans. Comput., pp. 426-435, Apr. 1990.
[18] P. Velardi and R. K. Iyer, "A study of software failures and recovery in the mvs operating system,"IEEE Trans. Comput., vol. C-33, June 1984.
[19] W. Feller,An Introduction to Probability Theory and its Applications, New York: Wiley, 1968.
[20] V. S. S. Nair, J. A. Abraham, and P. Banerjee, "Analysis of fault-tolerant multiprocessor systems using concurrent fault diagnosis,"IEEE Trans. Comput., submitted for pubication.
[21] C. Y. Chen and J. A. Abraham, "Fault-tolerant systems for the computation of eigenvalues and singular values," inProc. SPIE, Advanced Algorithms and Architectures for Signal Processing, Aug. 1986, pp. 228-237.

Index Terms:
probabilistic evaluation; fault detection; fault location; fault-tolerant multiprocessor systems; concurrent error detection; online checks; matrix-based model; probabilistic matrices; fault tolerant computing; multiprocessing systems; probability.
V.S.S. Nair, Y.V. Hoskote, J.A. Abraham, "Probabilistic Evaluation of Online Checks in Fault-Tolerant Multiprocessor Systems," IEEE Transactions on Computers, vol. 41, no. 5, pp. 532-541, May 1992, doi:10.1109/12.142679
Usage of this product signifies your acceptance of the Terms of Use.