This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
New Encoding/Decoding Methods for Designing Fault-Tolerant Matrix Operations
September 1996 (vol. 7 no. 9)
pp. 931-938

Abstract—Algorithm-based fault tolerance (ABFT) can provide a low-cost error protection for array processors and multiprocessor systems. Several ABFT techniques (weighted check-sum) have been proposed to design fault-tolerant matrix operations. In these schemes, encoding/decoding uses either multiplications or divisions so that overhead is high. In this paper, new encoding/decoding methods are proposed for designing fault-tolerant matrix operations. The unique feature of these new methods is that only additions and subtractions are used in encoding/decoding. In this paper, new algorithms are proposed to construct error detecting/correcting codes with the minimum Hamming distance 3 and 4. We will show that the overhead introduced due to the incorporation of fault tolerance is drastically reduced by using these new coding schemes.

[1] C. Anfinson and F.T. Luk, "A Linear Algebraic Model of Algorithm-Based Fault Tolerance," IEEE Trans. Computers, Dec. 1988, pp. 1599-1604.
[2] P. Banerjee and J.A. Abraham, "Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems," IEEE Trans. Computers, Apr. 1986, pp. 296-306.
[3] P. Banerjee et al., "An Evaluation of System-Level Fault Tolerance on Intel Hypercube Multiprocessor," Proc. FTCS-18, pp. 362-367, 1988.
[4] D.L. Boley and F.T. Luk, "A Well Conditioned Checksum for Algorithmic Fault Tolerance," Integration, the VLSI J., Elsevier Science Publishers, vol. 12, pp. 21-32, 1991.
[5] C.Y. Chen and J.A. Abraham, "Fault Tolerance Systems for the Computation of Eigenvalues and Singular Values," Proc. SPIE, Advanced Algorithms and Architectures for Signal Processing, vol. 696, pp. 222-227, 1986.
[6] D. Gu, D.J. Rosenkrantz, and S.S. Ravi, “Design and Analysis of Test Schemes for Algorithm-Based Fault Tolerance,” Proc. 20th IEEE Fault-Tolerant Computing Symp. (FTCS-20), pp. 106-113, 1990.
[7] K.H. Huang and J.A. Abraham, "Algorithm-Based Fault-Tolerance for Matrix Operations," IEEE Trans. Computers, vol. 33, no. 6, pp. 518-528, June 1984.
[8] J.Y. Jou and J.A. Abraham, "Fault Tolerant Matrix Arithmetic and Signal Processing on Highly Concurrent Computing Structures," Proc. IEEE, vol. 74, pp. 732-741, May 1986.
[9] J.Y. Jou and J.A. Abraham, "Fault Tolerant FFT Networks," IEEE Trans. Computers, Vol. 37, May 1988, pp. 548-561.
[10] K.Y. Lin, H. Krishna, and J.B. Wang, "Algebraic Techniques for Algorithm-Based Fault Tolerance in Signal Processing Systems," Proc. 23rd Asilomar Conf.Signals, Systems, and Computers, Oct. 1989.
[11] F.T. Luk, "Algorithm-Based Fault Tolerance for Parallel Matrix Equation Solvers," Proc. SPIE Real Time Signal Processing, vol. 564, pp. 49-53, 1985.
[12] F.T. Luk and H. Park, “An Analysis of Algorithm-Based Fault Tolerance Techniques,” J. Parallel and Distributed Computing, vol. 5, pp. 172-184, 1988.
[13] F.T. Luk and H. Park, “A Fault Tolerance Matrix Triangularizations on Systolic Arrays,” IEEE Trans. Computers, vol. 37, no. 11, pp. 1434-1438, Nov. 1988.
[14] V.S.S. Nair and J.A. Abraham, “General Linear Codes for Fault-Tolerant Matrix Operations on Processor Arrays,” Proc. 18th IEEE Fault-Tolerant Computing Symp. (FTCS-18), pp. 180-185, June 1988.
[15] V.S.S. Nair and J.A. Abraham, “Hierarchical Design and Analysis of Fault-Tolerant Multiprocessor Systems Using Concurrent Error Detection,” Proc. 20th IEEE Fault-Tolerant Computing Symp. (FTCS-20), pp. 130-137, 1990.
[16] T.R.N. Rao, Error Coding for Arithmetic Processors.New York: Academic Press, 1974.
[17] A.L.N. Reddy and P. Banerjee, "Algorithm Based Fault Detection for Signal Processing Applications," IEEE Trans. Computers, vol. 39, no. 11, pp. 1,304-1,308, Nov. 1990.
[18] D.L. Tao, C.R.P. Hartmann, and Y.S. Chen, "A Novel Concurrent Error Detection Scheme for FFT Networks," Proc. FTCS-20, pp. 114-121, June 1990.
[19] D.L. Tao and C.R.P. Hartmann, "Algorithm-Based Fault Tolerance for Matrix Operations," CEAS Technical Report 581, Dept. of Electrical Engineering, State University of New York at Stony Brook, Apr. 1990.
[20] B. Vinnakota and N.K. Jha, "A Dependence Graph-Based Approach to the Design of Algorithm-Based Fault Tolerant Systems," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 122-129,Newcastle-upon-Tyne, U.K., June 1990.

Index Terms:
Array processors, concurrent error detection/correction, error detecting/correcting codes, fault tolerance, multiprocessor systems.
Citation:
D.l. Tao, C.r.p. Hartmann, Yunghsing S. (Sam) Han, "New Encoding/Decoding Methods for Designing Fault-Tolerant Matrix Operations," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 9, pp. 931-938, Sept. 1996, doi:10.1109/71.536937
Usage of this product signifies your acceptance of the Terms of Use.