Issue No. 04 - April (1996 vol. 45)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/12.494098
<p><b>Abstract</b>—Algorithm-based fault tolerance is an inexpensive method of achieving fault tolerance without requiring any hardware modifications. Algorithm-based schemes have been proposed for a wide variety of numerical applications. However, for a particular class of numerical applications, namely those involving the iterative solution of linear systems arising from discretization of various PDEs, there exist almost no fault-tolerant algorithms in the literature. In this paper, we first describe an error-detecting version of a parallel algorithm for iteratively solving the Laplace equation over a rectangular grid. This error-detecting algorithm is based on the popular successive overrelaxation scheme with red-black ordering. We use the Laplace equation merely as a vehicle for discussion; later in the paper we show how to modify the algorithm to devise error-detecting iterative schemes for solving linear systems arising from discretizations of other PDEs, such as the Poisson equation and a variant of the Laplace equation with a mixed derivative term. We also discuss a modification of the basic scheme to handle situations where the underlying solution domain is not rectangular. We then discuss a somewhat different error-detecting algorithm for iterative solution of PDEs which can be expected to yield better error coverage.</p><p>We also present a new way of dealing with the roundoff errors which complicate the check phase of algorithm-based schemes. Our approach is based on error analysis incorporating some simplifications and gives high fault coverage and no false alarms for a large variety of data sets. We report experimental results on the error coverage and performance overhead of our algorithm-based error-detection schemes on an Intel iPSC/2 hypercube multiprocessor.</p><p>The timing overheads of our error-detecting algorithms over the basic iterative algorithms involving no error detection decrease with increasing problem dimension and become small for large data sizes.</p>
Algorithm-based fault-tolerance, parallel algorithms, partial differential equations, error analysis, fault injection.
A. Roy-Chowdhury, P. Banerjee and N. Bellas, "Algorithm-Based Error-Detection Schemes for Iterative Solution of Partial Differential Equations," in IEEE Transactions on Computers, vol. 45, no. , pp. 394-407, 1996.