Issue No.05 - May (2003 vol.52)
Michael Turmon , IEEE
Daniel S. Katz , IEEE
<p><b>Abstract</b>—We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form; the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision floating-point calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computational burden tradeoffs, and we emphasize average-case algorithm behavior rather than using worst-case upper bounds on error.</p>
Algorithm-based fault tolerance, result checking, error analysis, aerospace, parallel numerical algorithms.
Michael Turmon, Daniel S. Katz, John Z. Lou, "Tests and Tolerances for High-Performance Software-Implemented Fault Detection", IEEE Transactions on Computers, vol.52, no. 5, pp. 579-591, May 2003, doi:10.1109/TC.2003.1197125