This Article 
 Bibliographic References 
 Add to: 
Almost Certain Fault Diagnosis Through Algorithm-Based Fault Tolerance
May 1994 (vol. 5 no. 5)
pp. 532-539

Algorithm-based fault tolerance has been proposed as a technique to detect incorrectcomputations in multiprocessor systems. In algorithm-based fault tolerance, processorsproduce data elements that are checked by concurrent error detection mechanisms. Weinvestigate the efficacy of this approach for diagnosis of processor faults. Becausechecks are performed on data elements, the problem of location of data errors must firstbe solved. We propose a probabilistic model for the faults and errors in a multiprocessorsystem and use it to evaluate the probabilities of correct error location and faultdiagnosis. We investigate the number of checks that are necessary to guarantee errorlocation with high probability. We also give specific check assignments that accomplishthis goal. We then consider the problem of fault diagnosis when the locations oferroneous data elements are known. Previous work on fault diagnosis required that thedata sets produced by different processors be disjoint. We show, for the first time, thatfault diagnosis is possible with high probability, even in systems where processorscombine to produce individual data elements.

[1] P. Banerjee and J. A. Abraham, "Bounds on algorithm-based fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C-35, pp. 296-306, Apr. 1986.
[2] P. Banerjee and J. Abraham, "Concurrent fault diagnosis in multiple processor systems,"Digest of the 16th Int. Symp. Fault-Tolerant Computing, 1986, pp. 298-303.
[3] D. Blough, G. Sullivan, and G. Masson, "Efficient diagnosis of multiprocessor systems under probabilistic models,"IEEE Trans. Comput., vol. 41, pp. 1126-1136, Sept. 1992.
[4] D. Blough, G. Sullivan, and G. Masson, "Intermittent fault diagnosis in multiprocessor systems,"IEEE Trans. Comput., vol. 41, pp. 1430-1441, Nov. 1992.
[5] Y.-H. Choi and M. Malek, "A fault-tolerant systolic sorter,"IEEE Trans. Comput., vol. 37, pp. 621-624, May 1988.
[6] A. T. Dahbura, "System-level diagnosis: A perspective for the third decade,"Concurrent Computations: Algorithms, Architecture and Technology, S. Tewksbury, B. Dickinson, S. Schwartz, Eds. New York: Plenum, 1988.
[7] W. Feller,An Introduction to Probability Theory and Its Applications. New York: Wiley, 1966.
[8] S. Rangarajan and D. Fussell, "Diagnosing arbitrarily connected parallel computers with high probability,"IEEE Trans. Comput., vol. 41, pp. 606-615, May 1992.
[9] D. Gu, J. Rosenkrantz, and S. S. Ravi, "Design and analysis of test schemes for algorithm-based fault tolerance," inProc. 20th Int. Symp. Fault-Tolerant Comput., Newcastle, England, June 26-28, 1990, pp. 106-113.
[10] T. Hagerup and C. Rüb, "A guided tour of Chernoff bounds,"Inform. Processing Lett., vol. 33, pp. 305-308, Feb. 1990.
[11] K. Huang and J. Abraham, "Algorithm-based fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C-33, no. 6, pp. 518-528, June 1984.
[12] F. T. Luk and H. Park, "Analysis of algorithm-based fault tolerance techniques," inJ. Parallel Distribut. Comput., vol. 5, pp. 172-184, 1988.
[13] F. Luk and H. Park, "Fault-tolerant matrix triangulation on systolic arrays,"IEEE Trans. Comput., vol. 37, pp. 1434-1438, Nov. 1988.
[14] A. Mahmood and E. McCluskey, "Concurrent error detection using watchdog processors--A survey,"IEEE Trans. Comput., vol. 37, pp. 160-174, Feb. 1988.
[15] V. Nair and J. Abraham, "A model for the analysis of fault-tolerant signal processing architectures,"Proc. SPIE, Advanced Algorithms and Architectures for Signal Processing Applications, 1988, pp. 246-257.
[16] V. S. S. Nair and J. A. Abraham, "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection," inProc. 20th Int. Symp. Fault-Tolerant Comput., (FTCS-20), Newcastle upon Tyne, June 1990, pp. 130-137.
[17] V. Nair, Y. Hoskote, and J. Abraham, "Probabilistic evaluation of on-line checks in fault-tolerant multiprocessor systems,"IEEE Trans. Comput., vol. 41, pp. 532-541, May 1992.
[18] D. Rosenkrantz and S. Ravi, "Improved upper bounds for algorithm-based fault tolerance,"Proc. 26th Allerton Conf. Communic., Control and Computing, 1988, pp. 388-397.
[19] R. K. Sitaraman and N. K. Jha, "Optimal design of checks for error detection and location in fault-tolerant multiprocessor systems," inProc. 5th Int. Conf. Fault-Tolerant Comput. Syst., Nurnberg, Germany, Sept. 1991.
[20] D. L. Tao, C. R. P. Hartmann, and Y. S. Chen, "A novel concurrent error detection scheme for FFT networks," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 114-121.
[21] B. Vinnakota and N. K. Jha, "A dependence graph-based approach to the design of algorithm-based fault tolerant systems," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 122-129.
[22] B. Vinnakota and N. K. Jha, "Design of multiprocessor systems for concurrent error detection and fault diagnosis," inProc. Int. Symp. Fault Tolerant Comput., Montreal, June 1991.

Index Terms:
Index Termserror detection; multiprocessing systems; fault tolerant computing; reliability; failureanalysis; almost certain fault diagnosis; algorithm-based fault tolerance; multiprocessorsystems; incorrect computations; concurrent error detection mechanisms; erroneous dataelements; concurrent error detection; probabilistic analysis
D.M. Blough, A. Pelc, "Almost Certain Fault Diagnosis Through Algorithm-Based Fault Tolerance," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 5, pp. 532-539, May 1994, doi:10.1109/71.282563
Usage of this product signifies your acceptance of the Terms of Use.