This Article 
 Bibliographic References 
 Add to: 
Mantissa-Preserving Operations and Robust Algorithm-Based Fault Tolerance for Matrix Computations
April 1996 (vol. 45 no. 4)
pp. 408-424

Abstract—A system-level method for achieving fault tolerance called algorithm-based fault tolerance (ABFT) has been proposed by a number of researchers. Many ABFT schemes use a floating-point checksum test to detect computation errors resulting from hardware faults. This makes the tests susceptible to roundoff inaccuracies in floating-point operations, which either cause false alarms or lead to undetected errors. Thresholding of the equality test has been commonly used to avoid false alarms; however, a good threshold that minimizes false alarms without reducing the error coverage significantly is difficult to find, especially when not much is known about the input data. Furthermore, thresholded checksums will inevitably miss lower-bit errors, which can get magnified as a computation such as LU decomposition progresses. Here we develop a theory for applying integer mantissa checksum tests to "mantissa-preserving" floating-point computations. This test is not susceptible to roundoff problems and yields 100% error coverage without false alarms. For computations that are not fully mantissa-preserving, we show how to apply the mantissa checksum test to the mantissa-preserving components of the computation and the floating-point test to the rest of the computation. We apply this general methodology to matrix-matrix multiplication and LU decomposition (using the Gaussian elimination (GE) algorithm), and find that the accuracy of this new "hybrid" testing scheme is substantially higher than the floating-point test with thresholding, and also that its time overhead with respect to the floating-point test is nominal (15% and 9.5% on the average for matrix multiplication and LU decomposition, respectively). The hybrid test can also be easily applied to other computations like matrix inversion that use the GE algorithm. We prove that the mantissa-based integer checksum test for both matrix multiplication and LU decomposition is able to detect at least three errors in the floating-point multiplication component of these computations. For LU decomposition, it is also able to correct a single error in the floating-point multiplies.

[1] G.E. Alefeld,"Interval Arithmetic Tools and the Precise Scalar Product in Numerical Analysis," Trans. Society for Computer Simulation, vol. 6, pp. 189-209, July 1989.
[2] F.T. Assad and S. Dutt,"More Robust Tests in Algorithm-Based Fault-Tolerant MatrixMultiplication," Proc FTCS-22, pp. 430-439, June 1992.
[3] V. Balasubramanian and P. Banerjee, "Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors," IEEE Trans. Computers, vol. 39, no. 4, pp. 436-446, Apr. 1990.
[4] P. Banarjee, J.T. Rahmeh, C. Stunkel, V.S. Nair, K. Roy, V. Balasubramanian, and J.A. Abraham, “Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor,” IEEE Trans. Computers, vol. 39, no. 9, pp. 1132-1145, Sept. 1990.
[5] G. Bohlender,W. Walter,P. Kornerup, and D.W. Matula,"Semantics for Exact Floating-Point Operations," Proc. 10th IEEE Symp. Computer Arithmetic, pp. 22-26, June 1991.
[6] G. Bohlender,D. Cordes,A. Knofel,U. Kulisch,R. Lohner, and W.V. Walter,"Proposal for Accurate Floating-Point Vector Arithmetic," Scientific Computing with Automatic Result Verification, E. Adams and U. Kulisch, eds., pp. 87-102, Academic Press.
[7] D.L. Boley,G.H. Golub,S. Makar,N. Saxena, and E.J. McCluskey,"Floating Point Fault Tolerance Using Backward Error Assertions," IEEE Trans. Computers, Special Issue on Fault-Tolerant Computing, vol. 44, no. 2, pp. 302-311, Feb. 1995.
[8] R. Cromerford,"How DEC Developed Alpha," IEEE Spectrum, pp. 26-31, July 1992.
[9] D. Goldberg, “What Every Computer Scientist Should Know About Floating-Point Arithmetic,” Computing Surveys, vol. 23, no. 1, pp. 5-48, 1991.
[10] L. Geppert, “Not Your Father's CPU,” IEEE Spectrum, vol. 30, no. 12, pp. 20-23, Dec. 1993.
[11] G.H. Golub and C.F. Van Loan.,Matrix Computations.Baltimore: Johns Hopkins Univ. Press, 2nd edition, 1989.
[12] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, San Mateo, Calif., 1990.
[13] G.H. Hardy and E.M. Wright,An Introduction to the Theory of Numbers, Fifth Edition.New York: Oxford University Press, 1979.
[14] K. H. Huang and J.A. Abraham,"Algorithm-Based Fault Tolerance for Matrix Operations," IEEE Trans. Computers, vol. 33, no. 6, pp. 518-528, June 1984.
[15] U.W. Kulisch and W.L. Miranker,Computer Arithmetic in Theory and Practice.New York: Academic Press, 1981.
[16] U.W. Kulisch and W.L. Miranker, "The Arithmetic of the Digital Computer: A New Approach," SIAM Review, Mar. 1986, pp. 1-40.
[17] F.T. Luk,"Algorithm-Based Fault Tolerance for Parallel Mmatrix Solvers," Proc. SPIE Real-Time Signal Processing VIII, vol. 564, pp. 49-53, 1985.
[18] F.T. Luk and H. Park, “An Analysis of Algorithm-Based Fault Tolerance Techniques,” J. Parallel and Distributed Computing, vol. 5, pp. 172-184, 1988.
[19] M. Malek and Y.H. Choi,"A Fault-Tolerant FFT Processor," Proc. 15th Fault Tolerant Computing Symp., pp. 266-271,Ann Arbor, Mich., June 1985.
[20] M. Muller,C. Rub, and W. Rulling,"Exact Accumulation of Floating-Point Numbers," Proc. 10th IEEE Symp. Computer Arithmetic, pp. 64-69, June 1991.
[21] V.S.S. Nair and J.A. Abraham, "Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays," IEEE Trans. on Computers, Vol. 39, No. 4, Apr. 1990, pp. 426-435.
[22] R.P. Paul,SPARC Architecture, Assembly Language Programming,&C.Englewood Cliffs, N.J.: Prentice Hall, 1994.
[23] A. Roy-Chowdhury and P. Banerjee, “Tolerance Determination for Algorithm-Based Checks Using Simplified Error Analysis Techniques,” Proc. 23rd IEEE Fault-Tolerant Computing Symp. (FTCS-23), pp. 290-298, June 1993.
[24] D.P. Siewiorek and R. Swarz,The Theory and Practice of Reliable System Design.Bedford, Mass.: Digital Press, 1982.
[25] J.E. Smith and S. Weiss,"PowerPC 601 and Alpha 21064: A Tale of Two RISCs," Computer, pp. 46-58, June 1994.
[26] J.H. Wilkinson,Rounding Errors in Algebraic Processes.Englewood Cliffs, N.J.: Prentice Hall, 1963.
[27] J.H. Wilkinson,The Algebraic Eigenvalue Problem, Oxford Univ. Press, Oxford, UK, 1965.

Index Terms:
Algorithm-based fault tolerance, floating-point checksum test, hybrid checksum test, mantissa checksum test, mantissa-preserving operations, matrix multiplication, LU decomposition, roundoff errors, thresholding.
Shantanu Dutt, Fikri T. Assaad, "Mantissa-Preserving Operations and Robust Algorithm-Based Fault Tolerance for Matrix Computations," IEEE Transactions on Computers, vol. 45, no. 4, pp. 408-424, April 1996, doi:10.1109/12.494099
Usage of this product signifies your acceptance of the Terms of Use.