This Article 
 Bibliographic References 
 Add to: 
Algorithm-Based Error-Detection Schemes for Iterative Solution of Partial Differential Equations
April 1996 (vol. 45 no. 4)
pp. 394-407

Abstract—Algorithm-based fault tolerance is an inexpensive method of achieving fault tolerance without requiring any hardware modifications. Algorithm-based schemes have been proposed for a wide variety of numerical applications. However, for a particular class of numerical applications, namely those involving the iterative solution of linear systems arising from discretization of various PDEs, there exist almost no fault-tolerant algorithms in the literature. In this paper, we first describe an error-detecting version of a parallel algorithm for iteratively solving the Laplace equation over a rectangular grid. This error-detecting algorithm is based on the popular successive overrelaxation scheme with red-black ordering. We use the Laplace equation merely as a vehicle for discussion; later in the paper we show how to modify the algorithm to devise error-detecting iterative schemes for solving linear systems arising from discretizations of other PDEs, such as the Poisson equation and a variant of the Laplace equation with a mixed derivative term. We also discuss a modification of the basic scheme to handle situations where the underlying solution domain is not rectangular. We then discuss a somewhat different error-detecting algorithm for iterative solution of PDEs which can be expected to yield better error coverage.

We also present a new way of dealing with the roundoff errors which complicate the check phase of algorithm-based schemes. Our approach is based on error analysis incorporating some simplifications and gives high fault coverage and no false alarms for a large variety of data sets. We report experimental results on the error coverage and performance overhead of our algorithm-based error-detection schemes on an Intel iPSC/2 hypercube multiprocessor.

The timing overheads of our error-detecting algorithms over the basic iterative algorithms involving no error detection decrease with increasing problem dimension and become small for large data sizes.

[1] K.-H. Huang and J.A. Abraham,"Algorithm-Based Fault Tolerance for Matrix Operations," IEEE Trans. Computers, vol. 33, no. 6, pp. 518-528, June 1984.
[2] J.-Y. Jou and J.A. Abraham,"Fault-Tolerant Matrix Operations on Multiple Processor Systems Using Weighted Checksums," SPIE Proc., vol 495, Aug. 1984.
[3] P. Banarjee, J.T. Rahmeh, C. Stunkel, V.S. Nair, K. Roy, V. Balasubramanian, and J.A. Abraham, “Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor,” IEEE Trans. Computers, vol. 39, no. 9, pp. 1132-1145, Sept. 1990.
[4] J.Y. Jou and J.A. Abraham, "Fault Tolerant FFT Networks," IEEE Trans. Computers, Vol. 37, May 1988, pp. 548-561.
[5] Y. Choi and M. Malek,“A fault-tolerant FFT processor,” IEEE Trans. Computers, vol. 37, pp. 617-621, May 1988.
[6] V. Balasubramanian and P. Banarjee, “Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors,” IEEE Trans. Software Eng., vol. 16, no. 2, pp. 183-196, Feb. 1990.
[7] V. Balasubramanian,"The Analysis and Synthesis of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors," PhD dissertation, Univ. of Illi nois, Urbana-Champaign, Feb. 1991, Technical Report no. CRHC-91-6, UILU-ENG-91-2210.
[8] B.M. McMillin and L.M. Ni,"Reliable Distributed Sorting Through the Application-Oriented Fault Tolerance Paradigm," IEEE Trans. Parallel and Distributed Systems, vol. 3, pp. 411-420, July 1992.
[9] J.M. Ortega, Introduction to Parallel and Vector Solution of Linear Systems, Plenum Press, New York, 1988.
[10] G.H. Golub and J.M. Ortega, Scientific Computing: An Introduction with Parallel Computing, Academic Press, San Diego, Calif., 1993.
[11] F.T. Assad and S. Dutt,"More Robust Tests in Algorithm-Based Fault-Tolerant MatrixMultiplication," Proc FTCS-22, pp. 430-439, June 1992.
[12] A. Roy-Chowdhury,"Evaluation of Algorithm-based Fault-Tolerance Techniques on Multiple Fault Classes in the Presence of Finite Precision Arithmetic," MS thesis, Univ. of Illi nois, Urbana-Champaign, Aug. 1992, Technical Report no. CRHC-92-15, UILU-ENG-92-2228.
[13] A. Roy-Chowdhury and P. Banerjee, “Tolerance Determination for Algorithm-Based Checks Using Simplified Error Analysis Techniques,” Proc. 23rd IEEE Fault-Tolerant Computing Symp. (FTCS-23), pp. 290-298, June 1993.
[14] A. Roy-Chowdhury and P. Banerjee,"A New Error Analysis Based Method for Tolerance Computation for Algorithm-Based Checks," IEEE Trans. Computers, vol. 45, no. 2, pp. 238-243, Feb. 1996.
[15] K.-H. Huang,"Fault-Tolerant Algorithms for Multiple Processor Systems," PhD dissertation, Univ. of Illi nois, Urbana-Champaign, Nov. 1983, Technical Report no. CSG-20.
[16] J.H. Wilkinson,The Algebraic Eigenvalue Problem, Oxford Univ. Press, Oxford, UK, 1965.

Index Terms:
Algorithm-based fault-tolerance, parallel algorithms, partial differential equations, error analysis, fault injection.
Amber Roy-Chowdhury, Nikolas Bellas, Prithviraj Banerjee, "Algorithm-Based Error-Detection Schemes for Iterative Solution of Partial Differential Equations," IEEE Transactions on Computers, vol. 45, no. 4, pp. 394-407, April 1996, doi:10.1109/12.494098
Usage of this product signifies your acceptance of the Terms of Use.