This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Construction of Check Sets for Algorithm-Based Fault Tolerance
June 1994 (vol. 43 no. 6)
pp. 641-650

Algorithm-based fault tolerance (ABFT) is a popular approach to achieve fault and error detection in multiprocessor systems. The design problem for ABFT is concerned with the construction of a check set of minimum cardinality that detects a specified number of errors or faults. Previous work on this problem has assumed an a priori bound on the size of a check. We motivate and carry out an investigation of the problem without the bounded check size assumption. We establish upper and lower bounds on the number of checks needed to detect a given number of errors. The upper bounds are obtained through new schemes which are easy to implement, and the lower bounds are established using new types of arguments. These bounds are sharply different from those previously established under the bounded check size model. We also show that unlike error detection, the design problem for fault detection is NP-hard even for detecting only one fault.

[1] J. A. Abrahamet al., "Fault tolerance techniques for systolic arrays,"IEEE Comput. Mag., vol. 20, pp. 65-74, July 1987.
[2] P. Banerjee and J. A. Abraham, "Bounds on algorithm-based fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C-35, pp. 296-306, Apr. 1986.
[3] P. Banerjee and J. A. Abraham, "Concurrent Fault Diagnosis in Multiple Processor Systems," inProc. 16th Int. Symp. Fault-Tolerant Computing (FTCS-16), Vienna, Austria, July 1986, pp. 298-303.
[4] P. Banerjeeet al., "An evaluation of system-level fault tolerance on the intel hypercube multiprocessor," inProc. 18th Int. Symp. Fault-Tolerant Comput., 1988, pp. 362-367.
[5] D. M. Blough and A. Pelc, "Almost certain fault diagnosis through algorithm-based fault tolerance," Tech.Rep. ECE-92-09, Dept. of Elect. and Comput. Eng., Univ. of California, Irvine, CA, Aug. 1992.
[6] M. R. Garey and D. S. Johnson,Computers and Intractability: A Guide to Theory of NP-Completeness. San Francisco, CA: Freeman, 1979.
[7] D. Gu, D. J. Rosenkrantz, and S. S. Ravi, "Determining performance measures of algorithm-based fault tolerant systems,"J. Parallel and Distrib. Computing, vol. 18, no. 1, pp. 56-70, May 1993.
[8] K. H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C-33, no. 6, pp. 518-528, June 1984.
[9] J. Y. Jou and J. A. Abraham, "Fault tolerant FFT networks," inProc. 15th Int. Symp. Fault-Tolerant Computing (FTCS-15), June 1985, pp. 338-343.
[10] J. Y. Jou and J. A. Abraham, "Fault tolerant matrix arithmetic and signal processing on highly concurrent computing structures,"Proc. IEEE, vol. 74, no. 5, pp. 732-741, May 1986.
[11] F. T. Luk and H. Park, "Analysis of algorithm-based fault tolerance techniques," inJ. Parallel Distribut. Comput., vol. 5, pp. 172-184, 1988.
[12] M. Malek and Y. H. Choi, "A fault tolerant FFT processor," inProc. 15th Int Symp. Fault-Tolerant Computing (FTCS-15), June 1985, pp. 266-271.
[13] V. S. S. Nair and J. A. Abraham, "General linear codes for fault tolerant matrix operations on processor arrays," inProc. Int. Symp. Fault-Tolerant Comput., Tokyo, June 1988, pp. 180-185.
[14] V. S. S. Nair and J. A. Abraham, "A model for the analysis of fault-tolerant signal processing architectures," inProc. SPIE Conf., San Diego, CA, Aug. 1988.
[15] V. S. S. Nair and J. A. Abraham, "A model for the analysis, design and comparison of fault-tolerant WSI architectures," inProc. Workshop on Wafer Scale Integration, Como, Italy, June 1989.
[16] V. S. S. Nair and J. A. Abraham, "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection," inProc. 20th Int. Symp. Fault-Tolerant Comput., (FTCS-20), Newcastle upon Tyne, June 1990, pp. 130-137.
[17] D. J. Rosenkrantz and S. S. Ravi, "Improved bounds for algorithm-based fault tolerance,"IEEE Trans. Comput., vol. 42, no. 5, pp. 630-635, May 1993.
[18] R. K. Sitaraman and N. K. Jha, "Optimal design of checks for error detection and location in fault-tolerant multiprocessor systems," inProc. 5th Int. Conf. Fault-Tolerant Comput. Syst., Nurnberg, Germany, Sept. 1991.
[19] D. L. Tao, C. R. P. Hartmann, and Y. S. Chen, "A novel concurrent error detection scheme for FFT networks," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 114-121.
[20] B. Vinnakota and N. K. Jha, "A dependence graph-based approach to the design of algorithm-based fault tolerant systems," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 122-129.

Index Terms:
multiprocessing systems; computational complexity; error detection; fault tolerant computing; check sets; algorithm-based fault tolerance; error detection; multiprocessor systems; ABFT; check set; minimum cardinality; bounded check size assumption; bounded check size model; fault detection; design problem; NP-hard.
Citation:
D. Gu, D.J. Rosenkrantz, S.S. Ravi, "Construction of Check Sets for Algorithm-Based Fault Tolerance," IEEE Transactions on Computers, vol. 43, no. 6, pp. 641-650, June 1994, doi:10.1109/12.286298
Usage of this product signifies your acceptance of the Terms of Use.