This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Design of Algorithm-Based Fault-Tolerant Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis
October 1994 (vol. 5 no. 10)
pp. 1099-1106

Algorithm-based fault tolerance (ABPT) is a low-overhead system-level concurrent errordetection and fault location scheme for multiprocessor systems. We present new methodsfor the design of ABFT systems. Our design procedure is applicable to a wide range ofsystems in which processors share data elements. A feature of our design approach isthat the type of checks to be used in the final system can be controlled by the systemdesigner. We also present some new bounds on the number of checks needed in ABFTsystem design.

[1] K. H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C-33, pp. 518-528, June 1984.
[2] J. Y. Jou and J. A. Abraham, "Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures,"Proc. IEEE, vol. 74, pp. 732-741, May 1986.
[3] A. L. N. Reddy and P. Banerjee, "Algorithm-based fault tolerance for signal processing applications,"IEEE Trans. Comput., vol. 39, pp. 1304-1308, Oct. 1990.
[4] F.T. Luk and H. Park, "Fault-tolerant matrix triangularization on systolic arrays,"IEEE Trans. Comput., vol. 37, pp. 1434-1438, Nov. 1988.
[5] J.Y. Jou and J.A. Abraham, "Fault-tolerant FFT networks,"IEEE Trans. Comput., vol. 37, pp. 548-561, May 1988.
[6] Y.-H. Choi and M. Malek, "A fault-tolerant FFT processor,"IEEE Trans. Comput., vol. 37, pp. 617-621, May 1988.
[7] D. L. Tao, C. R. P. Hartmann, and Y. S. Chen, "A novel concurrent error detection scheme for FFT networks," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 114-121.
[8] S-J. Wang and N. K. Jha, "Algorithm-based fault tolerance for FFT networks," inProc. Int. Symp. Circuits Systems, San Diego, May 1992.
[9] Y.-H. Choi and M. Malek, "A fault-tolerant systolic sorter,"IEEE Trans. Comput., vol. 37, pp. 621-624, May 1988.
[10] B. Vinnakota and N. K. Jha, "A dependence graph-based approach to the design of algorithm-based fault tolerant systems," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 122-129.
[11] C. J. Anfinson and F. T. Luk, "A linear algebraic model of algorithm-based fault tolerance,"IEEE Trans. Comput., vol. 37, pp. 1599-1604, Dec. 1988.
[12] V. S. S. Nair and J. A. Abraham, "Real-number codes for fault-tolerant matrix operations on processor arrays,"IEEE Trans. Comput., vol. 39, pp. 426-435, Apr. 1990.
[13] J. Rexford and N. K. Jha, "Algorithm-based fault tolerance for floating-point operations in massively parallel systems," inProc. Int. Symp. Circuits Systems, San Diego, May 1992, pp. 649-652.
[14] P. Banerjee and J. A. Abraham, "Bounds on algorithm-based fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C-35, pp. 296-306, Apr. 1986.
[15] V.S.S. Nair and J.A. Abraham, "A model for the analysis of fault-tolerant signal processing architectures," inProc. 32nd Int. Tech. Symp. SPIE, 1988, pp. 246-257.
[16] V.S.S. Nair and J.A. Abraham, "Probabilistic evaluation of on-line checks in fault-tolerant multiprocessor systems,"IEEE Trans. Comput., vol. 41, pp. 532-541, May 1992.
[17] D. J. Rosenkrantz and S.S. Ravi, "Improved upper bounds for algorithm-based fault tolerance," inProc. 26th Allerton Conf. Comm., Control and Computing, 1988, pp. 388-397.
[18] D. Gu, J. Rosenkrantz, and S. S. Ravi, "Design and analysis of test schemes for algorithm-based fault tolerance," inProc. 20th Int. Symp. Fault-Tolerant Comput., Newcastle, England, June 26-28, 1990, pp. 106-113.
[19] V. S. S. Nair and J. A. Abraham, "A model for the analysis, design and comparison of fault-tolerant WSI architectures," inProc. Workshop on Wafer Scale Integration, Coma, Italy, June 1989.
[20] V. S. S. Nair and J. A. Abraham, "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection," inProc. 20th Int. Symp. Fault-Tolerant Comput., (FTCS-20), Newcastle upon Tyne, June 1990, pp. 130-137.
[21] R. K. Sitaraman and N. K. Jha, "Optimal design of checks for error detection and location in fault-tolerant multiprocessor systems,"IEEE Trans. Comput., vol. 42, pp. 780-793, July 1993.
[22] B. Vinnakota and N. K. Jha, "Diagnosability and diagnosis of algorithm-based fault-tolerant systems,"IEEE Trans. Comput., vol. 42, pp. 924-937, Aug. 1993.
[23] F. P. Preparata, G. Metze, and R. T. Chien, "On the connection assignment problem of diagnosable systems,"IEEE Trans. Electron. Comput., vol. EC-16, pp. 848-857, Dec. 1967.
[24] A.T. Dahbura and G. M. Masson, "AnO(n2.5)-fault identification algorithm for diagnosable systems,"IEEE Trans. Comput., vol. C-33, pp. 486-492, June 1984.
[25] N. H. Vaidya and D. K. Pradhan, "System level diagnosis: Combining detection and location," inDig. of papers: The 21st Int. Symp. Fault-Tolerant Comput., 1991, pp. 488-495.

Index Terms:
Index Termsfault tolerant computing; reliability; multiprocessing systems; fault location; parallelarchitectures; system recovery; fault-tolerant multiprocessor systems; algorithm-basedmultiprocessor systems; concurrent error detection; fault diagnosis; algorithm-based faulttolerance; low-overhead system-level error detection; fault location scheme; ABFTsystems; design procedure; data element sharing; ABFT system design
Citation:
V. Vinnakota, N.K. Jha, "Design of Algorithm-Based Fault-Tolerant Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 10, pp. 1099-1106, Oct. 1994, doi:10.1109/71.313125
Usage of this product signifies your acceptance of the Terms of Use.