This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems
November 1996 (vol. 45 no. 11)
pp. 1239-1247

Abstract—Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a 16-processor Intel iPSC/2 hypercube multicomputer, which demonstrate acceptably low overheads for the single and double fault location and recovery cases.

[1] K.-H. Huang and J.A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," IEEE Trans. Computers, vol. 33, no. 6, pp. 518-528, June 1984.
[2] J.-Y. Jou and J.A. Abraham, "Fault-Tolerant Matrix Operations on Multiple Processor Systems Using Weighted Checksums," SPIE Proc., vol. 495, Aug. 1984.
[3] Y. Choi and M. Malek,“A fault-tolerant FFT processor,” IEEE Trans. Computers, vol. 37, pp. 617-621, May 1988.
[4] J.Y. Jou and J.A. Abraham, "Fault Tolerant FFT Networks," IEEE Trans. Computers, Vol. 37, May 1988, pp. 548-561.
[5] P. Banarjee, J.T. Rahmeh, C. Stunkel, V.S. Nair, K. Roy, V. Balasubramanian, and J.A. Abraham, “Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor,” IEEE Trans. Computers, vol. 39, no. 9, pp. 1132-1145, Sept. 1990.
[6] V. Balasubramanian,"The Analysis and Synthesis of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors," PhD dissertation, Univ. of Illi nois, Urbana-Champaign, Feb. 1991, Technical Report no. CRHC-91-6, UILU-ENG-91-2210.
[7] P. Banerjee and J.A. Abraham, "Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems," IEEE Trans. Computers, Apr. 1986, pp. 296-306.
[8] D.J. Rosenkrantz and S.S. Ravi, "Improved Upper Bounds for Algorithm-Based Fault Tolerance," Proc. 26th Allerton Conf. Comm. Cont., and Computers, pp. 388-397, Sept. 1988.
[9] B. Vinnakota and N.K. Jha, "Diagnosability and Diagnosis of Algorithm-Based Fault Tolerant Systems," IEEE Trans. Computers, vol. 42, no. 8, pp. 924-937, Aug. 1993.
[10] V.S.S. Nair and J.A. Abraham, "Probabilistic Evaluation of On-Line Checks in Fault-Tolerant Multiprocessor Systems," IEEE Trans. Computers, vol. 41, no. 5, pp. 532-541, May 1992.
[11] V.S.S. Nair and J.A. Abraham, “Hierarchical Design and Analysis of Fault-Tolerant Multiprocessor Systems Using Concurrent Error Detection,” Proc. 20th IEEE Fault-Tolerant Computing Symp. (FTCS-20), pp. 130-137, 1990.
[12] D. Gu, D.J. Rosenkrantz, and S.S. Ravi, “Design and Analysis of Test Schemes for Algorithm-Based Fault Tolerance,” Proc. 20th IEEE Fault-Tolerant Computing Symp. (FTCS-20), pp. 106-113, 1990.
[13] V.S.S. Nair,"Analysis and Design of Algorithm-Based Fault-Tolerant Systems," PhD thesis, Univ. of Illi nois, Urbana, 1990.
[14] B. Vinnakota and N.K. Jha, "A Dependence Graph-Based Approach to the Design of Algorithm-Based Fault Tolerant Systems," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 122-129,Newcastle-upon-Tyne, U.K., June 1990.
[15] B. Vinnakota and N.K. Jha, “Design of Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis,” Proc. 21st IEEE Fault-Tolerant Computing Symp. (FTCS-21), pp. 504-511, 1991.
[16] R. Sitaraman and N.K. Jha, "Optimal Design of Checks for Error Detection and Location in Fault Tolerant Multiprocessor Systems," IEEE Trans. Computers, vol. 42, no. 7, pp. 780-793, July 1993.
[17] F.T. Luk and H. Park, “A Fault Tolerance Matrix Triangularizations on Systolic Arrays,” IEEE Trans. Computers, vol. 37, no. 11, pp. 1434-1438, Nov. 1988.
[18] C. Anfinson and F.T. Luk, "A Linear Algebraic Model of Algorithm-Based Fault Tolerance," IEEE Trans. Computers, Dec. 1988, pp. 1599-1604.
[19] F.P. Preparata, G. Metze, and R.T. Chien, "On the Connection Assignment Problem of Diagnosable Systems," IEEE Trans. Electronic Computers, vol. 16, no. 12, pp. 848-854, Dec. 1967.
[20] D.K. Pradhan, “Fault-Tolerant Computing: Theory and Techniques,” vol. II, pp. 492-496, chapter 6. Prentice Hall, 1986.
[21] R.E. Blahut, Theory and Practice of Error Control Codes.Reading, Mass.: Addison-Wesley, 1984.
[22] V.S.S. Nair and J.A. Abraham, "Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays," IEEE Trans. on Computers, Vol. 39, No. 4, Apr. 1990, pp. 426-435.
[23] G. Strang, Linear Algebra and Its Applications.San Diego: Harcourt Brace Jova novich, 1988.
[24] A. Roy-Chowdhury and P. Banerjee, "Algorithm-Based Fault Location and Recovery for Matrix Computations," Proc. 24th FTCS, pp. 38-48, 1994.
[25] G.H. Golub and C.F.V. Loan, Matrix Computations.Baltimore: Johns Hopkins Univ. Press, 1987.
[26] A. Roy-Chowdhury, "Evaluation of Algorithm Based Fault-Tolerance Techniques on Multiple Fault Classes in the Presence of Finite Precision Arithmetic," Technical Report CRHC-92-15, UILU-ENG-92-2228, MS thesis, Univ. of Illi nois, Urbana-Champaign, Aug. 1992.

Index Terms:
Algorithm-based fault-tolerance, parallel numerical algorithms, fault location, fault recovery, system level diagnosis, coding theory.
Citation:
Amber Roy-Chowdhury, Prithviraj Banerjee, "Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems," IEEE Transactions on Computers, vol. 45, no. 11, pp. 1239-1247, Nov. 1996, doi:10.1109/12.544480
Usage of this product signifies your acceptance of the Terms of Use.