Issue No.11 - November (1996 vol.45)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/12.544480
<p><b>Abstract</b>—Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a 16-processor Intel iPSC/2 hypercube multicomputer, which demonstrate acceptably low overheads for the single and double fault location and recovery cases.</p>
Algorithm-based fault-tolerance, parallel numerical algorithms, fault location, fault recovery, system level diagnosis, coding theory.
Amber Roy-Chowdhury, "Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems", IEEE Transactions on Computers, vol.45, no. 11, pp. 1239-1247, November 1996, doi:10.1109/12.544480