This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Efficient Algorithm-Based Fault Tolerance Design Using the Weighted Data-Check Relationship
April 2001 (vol. 50 no. 4)
pp. 371-383

Abstract—VLSI-based processor arrays have been widely used for computation intensive applications such as matrix and graph algorithms. Algorithm-based fault tolerance designs employing various encoding/decoding schemes have been proposed for such systems to effectively tolerate operation time fault. In this paper, we propose an efficient algorithm-based fault tolerance design using the weighted data-check relationship, where the checks are obtained from the weighted data. The relationship is systematically defined as a new $(n,k,N_w)$ Hamming checksum code, where $n$ is the size of the code word, $k$ is the number of information elements in the code word, and $N_w$ is the number of weights employed, respectively. The proposed design with various weights is evaluated in terms of time and hardware overhead as well as overflow probability and round-off error. Two different schemes employing the $(n,k,2)$ and $(n,k,3)$ Hamming checksum code are illustrated using important matrix computations. Comparison with other schemes reveals that the $(n,k,3)$ Hamming checksum scheme is very efficient, while the hardware overhead is small.

[1] S.Y. Kung, VLSI Array Processors. Prentice Hall, 1988.
[2] H. Ahmed, J. Delosme, and M. Morf, “Highly Concurrent Computing Structures for Matrix Arithmetic and Signal Processing,” Computer, vol. 15, no. 1, pp. 65-82, Jan. 1982.
[3] J.H. Moreno and T. Lang, “Matrix Computations on Systolic-Type Meshes,” Computer, vol. 20, no. 4, pp. 32-51, Apr. 1990.
[4] K.H. Huang and J.A. Abraham, “Algorithm-Based Fault-Tolerance for Matrix Operations,” IEEE Trans. Computers, vol. 3, no. 6, pp. 518-528, June 1984.
[5] V.S.S. Nair and J.A. Abraham, “General Linear Codes for Fault-Tolerant Matrix Operations on Processor Arrays,” Proc. 18th IEEE Fault-Tolerant Computing Symp. (FTCS-18), pp. 180-185, June 1988.
[6] J.Y. Jou and J.A. Abraham, “Fault Tolerant Matrix Arithmetic and Signal Processing on Highly Concurrent Computing Structures,” Proc. IEEE, vol. 74, pp. 732-741, May 1986.
[7] F.T. Luk and H. Park, “An Analysis of Algorithm-Based Fault Tolerance Techniques,” J. Parallel and Distributed Computing, vol. 5, pp. 172-184, 1988.
[8] D.L. Tao and C.R.P. Hatmann, “Algorithm-Based Fault Tolerance Technique for Computing Matrix Operations,” CEAS Technical Report 581, Dept. of Electrical Eng., State Univ. of New York at Stony Brook, Apr. 1990.
[9] V.S.S. Nair and J.A. Abraham, "Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays," IEEE Trans. on Computers, Vol. 39, No. 4, Apr. 1990, pp. 426-435.
[10] V. Balasubramanian and P. Banarjee, “Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors,” IEEE Trans. Software Eng., vol. 16, no. 2, pp. 183-196, Feb. 1990.
[11] P. Banarjee, J.T. Rahmeh, C. Stunkel, V.S. Nair, K. Roy, V. Balasubramanian, and J.A. Abraham, “Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor,” IEEE Trans. Computers, vol. 39, no. 9, pp. 1132-1145, Sept. 1990.
[12] Y.M. Yeh and T.Y. Feng, “Algorithm Based Fault Tolerance for Matrix Inversion with Maximum Pivoting,” J. Parallel and Distributed Computing, vol. 14, pp. 373-389, 1992.
[13] F.T. Luk and H. Park, “A Fault Tolerance Matrix Triangularizations on Systolic Arrays,” IEEE Trans. Computers, vol. 37, no. 11, pp. 1434-1438, Nov. 1988.
[14] P. Banerjee and J.A. Abraham, "Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems," IEEE Trans. Computers, Apr. 1986, pp. 296-306.
[15] V.S.S. Nair and J.A. Abraham, “Hierarchical Design and Analysis of Fault-Tolerant Multiprocessor Systems Using Concurrent Error Detection,” Proc. 20th IEEE Fault-Tolerant Computing Symp. (FTCS-20), pp. 130-137, 1990.
[16] B. Vinnakota and N.K. Jha, “Design of Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis,” Proc. 21st IEEE Fault-Tolerant Computing Symp. (FTCS-21), pp. 504-511, 1991.
[17] D. Gu, D.J. Rosenkrantz, and S.S. Ravi, “Design and Analysis of Test Schemes for Algorithm-Based Fault Tolerance,” Proc. 20th IEEE Fault-Tolerant Computing Symp. (FTCS-20), pp. 106-113, 1990.
[18] I. Koren and D.K. Pradhan, “Introducing Redundancy into VLSI Designs for Yield and Performance Enhancement,” Proc. 15th IEEE Fault-Tolerant Computing Symp. (FTCS-15), pp. 330-335, June 1985.
[19] A.D. Singh and H.Y. Youn, “A Modular Fault Tolerant Binary Tree Architecture for VLSI,” IEEE Trans. Computers, vol. 40, no. 7, pp. 882-890, July 1991.
[20] R.W. Hamming, Coding and Information Theory.Englewood Cliffs, N.J.: Prentice Hall, 1980.
[21] W.W. Peterson and E.J. Weldon Jr., Error-Correcting Codes. Cambridge, Mass.: MIT Press, 1981.
[22] C.G. Oh and H.Y. Youn, “An Efficient Algorithm-Based Fault Tolerance Design Using Extended Rearranged Hamming Checksum,” Proc. IEEE Int'l Workshop Defect and Fault Tolerance in VLSI Systems, pp. 237-246, Nov. 1992.
[23] C.G. Oh and H.Y. Youn, “On Concurrent Error Detection, Location, and Correction of FFT Network,” Proc. 23rd IEEE Fault-Tolerant Computing Symp. (FTCS-23), pp. 596-605, June 1993.
[24] C.G. Cullen, Linear Algebra with Applications. Scott, Foresman and Company, 1988.
[25] C.F. Gerald and P.O. Wheatley, Applied Numerical Analysis, fourth ed. Addition-Wesley, 1989.
[26] B.W. Johnson, Design and Analysis of Fault-Tolerant Digital Systems, pp. 394-402. Reading, Mass.: Addison-Wesley, June 1989.
[27] A. Roy-Chowdhury and P. Banerjee, “Tolerance Determination for Algorithm-Based Checks Using Simplified Error Analysis Techniques,” Proc. 23rd IEEE Fault-Tolerant Computing Symp. (FTCS-23), pp. 290-298, June 1993.
[28] K.N.B. Murthy, K. Bhuvaneswari, and C.S. Ram Murthy, “A New Algorithm Based on Givens Rotations for Solving Linear Equations on Fault-Tolerant Mesh-Connected Processors,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 8, pp. 825-832, Aug. 1998.
[29] G.R. Redinbo, “Generalized Algorithm-Based Fault Tolerance: Error Correction via Kalman Estimation,” IEEE Trans. Computers, vol. 47, no. 6, pp. 639-655, June 1998.
[30] S. Yajnik and N.K. Jha, “Analysis and Randomized Design of Algorithm-Based Fault Tolerant Multiprocessor Systems under an Extended Model,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 7, pp. 639-655, July 1997.
[31] S. Yajnik and N.K. Jha, “Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 2, pp. 137-153, Feb. 1997.
[32] B. Vinnakota and N.K. Jha, “Design of Algorithm-Based Fault-Tolerant Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 10, pp. 1099-1106, Oct. 1994.
[33] D. Gu, D.J. Rosenkrantz, and S.S. Ravi, "Construction of Check Sets for Algorithm-Based Fault Tolerance," IEEE Trans. Computers, June 1994, pp. 641-650.
[34] D.J. Rosenkrantz and S.S. Ravi, “Improved Bounds for Algorithm-Based Fault Tolerance,” IEEE Trans. Computers, vol. 42, no. 5, pp. 630-635, May 1993.

Index Terms:
Algorithm-based fault tolerance, Hamming correcting code, matrix computations, overflow, round-off error, VLSI processor array.
Citation:
Hee Yong Youn, Choong Gun Oh, Hyunseung Choo, Jin-Wook Chung, Dongman Lee, "An Efficient Algorithm-Based Fault Tolerance Design Using the Weighted Data-Check Relationship," IEEE Transactions on Computers, vol. 50, no. 4, pp. 371-383, April 2001, doi:10.1109/12.919281
Usage of this product signifies your acceptance of the Terms of Use.