This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems
August 1993 (vol. 42 no. 8)
pp. 924-937

Parallel processing architectures are commonly used for signal processing and other computationally intensive applications. These applications are characterized by high throughput and long processing periods. Such characteristics decrease the reliability of high-performance architectures. The erroneous data produced by faulty processors could have damaging consequences, particularly in critical real-time applications. It is therefore desirable that any erroneous data produced by the system be detected and located as quickly as possible. Algorithm-based fault tolerance (ABFT) is a low-cost system-level concurrent error detection and fault location scheme. Methods used in the analysis of multiprocessor systems using system-level diagnosis are applied to the analysis of ABFT systems. A new algorithm for analyzing an ABFT system for its fault diagnosability is developed using these methods. Based on this work, a fault diagnosis algorithm is developed for ABFT systems.

[1] K. H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C-33, pp. 518-528, June 1984.
[2] J. Y. Jou and J. A. Abraham, "Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures,"Proc. IEEE, vol. 74, pp. 732-741, May 1986.
[3] A. L. N. Reddy and P. Banerjee, "Algorithm-based fault tolerance for signal processing applications,"IEEE Trans. Comput., vol. C-39, pp. 1304-1308, Oct. 1990.
[4] F. T. Luk and H. Park, "Fault-tolerant matrix triangularization on systolic arrays,"IEEE Trans. Comput., vol. C-37, pp. 1434-1438, Nov. 1988.
[5] J. Y. Jou and J. A. Abraham, "Fault-tolerant FFT networks,"IEEE Trans. Comput., vol. C-37, pp. 548-561, May 1988.
[6] Y-H. Choi and M. Malek, "A fault-tolerant FFT processor,"IEEE Trans. Comput., vol. C-37, pp. 617-621, May 1988.
[7] D. L. Tao, C. R. P. Hartmann, and Y. S. Chen, "A novel concurrent error detection scheme for FFT networks," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 114-121.
[8] S-J. Wang and N. K. Jha, "Algorithm-based fault tolerance for FFT networks," inProc. Int. Symp. Circuits Systems, San Diego, May 1992.
[9] Y-H. Choi and M. Malek, "A fault-tolerant systolic sorter,"IEEE Trans. Comput., vol. C-37, pp. 621-624, May 1988.
[10] P. Banerjeeet al., "Algorithm-based fault tolerance on a hypercube multiprocessor,"IEEE Trans. Comput., vol. C-39, pp. 1132-1145, Sept. 1990.
[11] C. J. Anfinson and F. T. Luk, "A linear algebraic model of algorithm-based fault tolerance,"IEEE Trans. Comput., vol. C-37, pp. 1599-1604, Dec. 1988.
[12] V. S. S. Nair and J. A. Abraham, "Real-number codes for fault-tolerant matrix operations on processor arrays,"IEEE Trans. Comput., vol. C-39, pp. 426-435, Apr. 1990.
[13] J. Rexford and N. K. Jha, "Algorithm-based fault tolerance for floating-point operations in massively parallel systems," inProc. Int. Symp. Circuits Systems, San Diego, May 1992, pp. 649-652.
[14] P. Banerjee and J. A. Abraham, "Bounds on algorithm-based fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C-35, pp. 296-306, Apr. 1986.
[15] V. S. S. Nair and J. A. Abraham, "A model for the analysis of fault-tolerant signal processing architectures," inProc. Int. Tech. Symp. SPIE, San Diego, Aug. 1988, pp. 246-257.
[16] D. Gu, J. Rosenkrantz, and S. S. Ravi, "Design and analysis of test schemes for algorithm-based fault tolerance," inProc. 20th Int. Symp. Fault-Tolerant Comput., Newcastle, England, June 26-28, 1990, pp. 106-113.
[17] V. S. S. Nair and J. A. Abraham, "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection," inProc. 20th Int. Symp. Fault-Tolerant Comput., (FTCS-20), Newcastle upon Tyne, June 1990, pp. 130-137.
[18] B. Vinnakota and N. K. Jha, "Design of multiprocessor systems for concurrent error detection and fault diagnosis," inProc. Int. Symp. Fault Tolerant Comput., Montreal, June 1991.
[19] R. K. Sitaraman and N. K. Jha, "Optimal design of checks for error detection and location in fault tolerant multiprocessor systems," accepted for publication inIEEE Trans. Comput.
[20] B. Vinnakota and N. K. Jha, "A dependence graph-based approach to the design of algorithm-based fault tolerant systems," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 122-129.
[21] V. S. S. Nair, Y. V. Hoskote, and J. A. Abraham, "Probabilistic evaluation of on-line checks in fault-tolerant multiprocessor systems,"IEEE Trans. Comput., vol. C-41, pp. 532-541, May 1992.
[22] F. P. Preparata, G. Metze, and R. T. Chien, "On the connection assignment problem of diagnosable systems,"IEEE Trans. Electronic Comput., vol. EC-16, pp. 848-857, Dec. 1967.
[23] J. D. Russell and C. R. Kime, "System fault diagnosis: Closure and diagnosability with repair,"IEEE Trans. Comput., vol. C-24, pp. 1078-1088, Nov. 1973.
[24] J. D. Russell and C. R. Kime, "System fault diagnosis: Masking, exposure, and diagnosability without repair,"IEEE Trans. Comput., vol. C-24, pp. 1155-1161, Dec. 1973.
[25] A. T. Dahbura and G. M. Masson, "AnO(n2.5) fault identification algorithm for diagnosable systems,"IEEE Trans. Comput., vol. C-33, pp. 486-492, June 1984.
[26] P. Banerjee and J. A. Abraham, "Concurrent fault diagnosis in multiple processor systems," inProc. Int. Symp. Fault-Tolerant Comput., Vienna, July 1986, pp. 298-303.
[27] P. Banerjee, "A theory for algorithm-based fault tolerance in array processor systems," (Ph.D. dissertation) Rep. CSG-39, Coordinated Sci. Lab., Univ. Illinois at Urbana-Champaign, Dec. 1984.
[28] V. S. S. Nair, "Analysis and design of algorithm-based fault-tolerant systems," Ph.D. dissertation, Univ. of Illinois, Urbana, IL, 1990.
[29] Z. Kohavi,Switching and Finite Automata Theory, second edition. New York: McGraw-Hill, 1978.

Index Terms:
diagnosability; parallel processing architectures; diagnosis; algorithm-based fault-tolerant systems; signal processing; faulty processors; concurrent error detection; fault location scheme; multiprocessor systems; system-level diagnosis; fault tolerant computing; parallel processing.
Citation:
B. Vinnakota, N.K. Jha, "Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems," IEEE Transactions on Computers, vol. 42, no. 8, pp. 924-937, Aug. 1993, doi:10.1109/12.238483
Usage of this product signifies your acceptance of the Terms of Use.