
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Shantanu Dutt, Fikri T. Assaad, "MantissaPreserving Operations and Robust AlgorithmBased Fault Tolerance for Matrix Computations," IEEE Transactions on Computers, vol. 45, no. 4, pp. 408424, April, 1996.  
BibTex  x  
@article{ 10.1109/12.494099, author = {Shantanu Dutt and Fikri T. Assaad}, title = {MantissaPreserving Operations and Robust AlgorithmBased Fault Tolerance for Matrix Computations}, journal ={IEEE Transactions on Computers}, volume = {45}, number = {4}, issn = {00189340}, year = {1996}, pages = {408424}, doi = {http://doi.ieeecomputersociety.org/10.1109/12.494099}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Computers TI  MantissaPreserving Operations and Robust AlgorithmBased Fault Tolerance for Matrix Computations IS  4 SN  00189340 SP408 EP424 EPD  408424 A1  Shantanu Dutt, A1  Fikri T. Assaad, PY  1996 KW  Algorithmbased fault tolerance KW  floatingpoint checksum test KW  hybrid checksum test KW  mantissa checksum test KW  mantissapreserving operations KW  matrix multiplication KW  LU decomposition KW  roundoff errors KW  thresholding. VL  45 JA  IEEE Transactions on Computers ER   
Abstract—A systemlevel method for achieving fault tolerance called algorithmbased fault tolerance (ABFT) has been proposed by a number of researchers. Many ABFT schemes use a floatingpoint checksum test to detect computation errors resulting from hardware faults. This makes the tests susceptible to roundoff inaccuracies in floatingpoint operations, which either cause false alarms or lead to undetected errors. Thresholding of the equality test has been commonly used to avoid false alarms; however, a good threshold that minimizes false alarms without reducing the error coverage significantly is difficult to find, especially when not much is known about the input data. Furthermore, thresholded checksums will inevitably miss lowerbit errors, which can get magnified as a computation such as LU decomposition progresses. Here we develop a theory for applying integer mantissa checksum tests to "mantissapreserving" floatingpoint computations. This test is not susceptible to roundoff problems and yields 100% error coverage without false alarms. For computations that are not fully mantissapreserving, we show how to apply the mantissa checksum test to the mantissapreserving components of the computation and the floatingpoint test to the rest of the computation. We apply this general methodology to matrixmatrix multiplication and LU decomposition (using the Gaussian elimination (GE) algorithm), and find that the accuracy of this new "hybrid" testing scheme is substantially higher than the floatingpoint test with thresholding, and also that its time overhead with respect to the floatingpoint test is nominal (15% and 9.5% on the average for matrix multiplication and LU decomposition, respectively). The hybrid test can also be easily applied to other computations like matrix inversion that use the GE algorithm. We prove that the mantissabased integer checksum test for both matrix multiplication and LU decomposition is able to detect at least three errors in the floatingpoint multiplication component of these computations. For LU decomposition, it is also able to correct a single error in the floatingpoint multiplies.
[1] G.E. Alefeld,"Interval Arithmetic Tools and the Precise Scalar Product in Numerical Analysis," Trans. Society for Computer Simulation, vol. 6, pp. 189209, July 1989.
[2] F.T. Assad and S. Dutt,"More Robust Tests in AlgorithmBased FaultTolerant MatrixMultiplication," Proc FTCS22, pp. 430439, June 1992.
[3] V. Balasubramanian and P. Banerjee, "CompilerAssisted Synthesis of AlgorithmBased Checking in Multiprocessors," IEEE Trans. Computers, vol. 39, no. 4, pp. 436446, Apr. 1990.
[4] P. Banarjee, J.T. Rahmeh, C. Stunkel, V.S. Nair, K. Roy, V. Balasubramanian, and J.A. Abraham, “AlgorithmBased Fault Tolerance on a Hypercube Multiprocessor,” IEEE Trans. Computers, vol. 39, no. 9, pp. 11321145, Sept. 1990.
[5] G. Bohlender,W. Walter,P. Kornerup, and D.W. Matula,"Semantics for Exact FloatingPoint Operations," Proc. 10th IEEE Symp. Computer Arithmetic, pp. 2226, June 1991.
[6] G. Bohlender,D. Cordes,A. Knofel,U. Kulisch,R. Lohner, and W.V. Walter,"Proposal for Accurate FloatingPoint Vector Arithmetic," Scientific Computing with Automatic Result Verification, E. Adams and U. Kulisch, eds., pp. 87102, Academic Press.
[7] D.L. Boley,G.H. Golub,S. Makar,N. Saxena, and E.J. McCluskey,"Floating Point Fault Tolerance Using Backward Error Assertions," IEEE Trans. Computers, Special Issue on FaultTolerant Computing, vol. 44, no. 2, pp. 302311, Feb. 1995.
[8] R. Cromerford,"How DEC Developed Alpha," IEEE Spectrum, pp. 2631, July 1992.
[9] D. Goldberg, “What Every Computer Scientist Should Know About FloatingPoint Arithmetic,” Computing Surveys, vol. 23, no. 1, pp. 548, 1991.
[10] L. Geppert, “Not Your Father's CPU,” IEEE Spectrum, vol. 30, no. 12, pp. 2023, Dec. 1993.
[11] G.H. Golub and C.F. Van Loan.,Matrix Computations.Baltimore: Johns Hopkins Univ. Press, 2nd edition, 1989.
[12] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, San Mateo, Calif., 1990.
[13] G.H. Hardy and E.M. Wright,An Introduction to the Theory of Numbers, Fifth Edition.New York: Oxford University Press, 1979.
[14] K. H. Huang and J.A. Abraham,"AlgorithmBased Fault Tolerance for Matrix Operations," IEEE Trans. Computers, vol. 33, no. 6, pp. 518528, June 1984.
[15] U.W. Kulisch and W.L. Miranker,Computer Arithmetic in Theory and Practice.New York: Academic Press, 1981.
[16] U.W. Kulisch and W.L. Miranker, "The Arithmetic of the Digital Computer: A New Approach," SIAM Review, Mar. 1986, pp. 140.
[17] F.T. Luk,"AlgorithmBased Fault Tolerance for Parallel Mmatrix Solvers," Proc. SPIE RealTime Signal Processing VIII, vol. 564, pp. 4953, 1985.
[18] F.T. Luk and H. Park, “An Analysis of AlgorithmBased Fault Tolerance Techniques,” J. Parallel and Distributed Computing, vol. 5, pp. 172184, 1988.
[19] M. Malek and Y.H. Choi,"A FaultTolerant FFT Processor," Proc. 15th Fault Tolerant Computing Symp., pp. 266271,Ann Arbor, Mich., June 1985.
[20] M. Muller,C. Rub, and W. Rulling,"Exact Accumulation of FloatingPoint Numbers," Proc. 10th IEEE Symp. Computer Arithmetic, pp. 6469, June 1991.
[21] V.S.S. Nair and J.A. Abraham, "RealNumber Codes for FaultTolerant Matrix Operations on Processor Arrays," IEEE Trans. on Computers, Vol. 39, No. 4, Apr. 1990, pp. 426435.
[22] R.P. Paul,SPARC Architecture, Assembly Language Programming,&C.Englewood Cliffs, N.J.: Prentice Hall, 1994.
[23] A. RoyChowdhury and P. Banerjee, “Tolerance Determination for AlgorithmBased Checks Using Simplified Error Analysis Techniques,” Proc. 23rd IEEE FaultTolerant Computing Symp. (FTCS23), pp. 290298, June 1993.
[24] D.P. Siewiorek and R. Swarz,The Theory and Practice of Reliable System Design.Bedford, Mass.: Digital Press, 1982.
[25] J.E. Smith and S. Weiss,"PowerPC 601 and Alpha 21064: A Tale of Two RISCs," Computer, pp. 4658, June 1994.
[26] J.H. Wilkinson,Rounding Errors in Algebraic Processes.Englewood Cliffs, N.J.: Prentice Hall, 1963.
[27] J.H. Wilkinson,The Algebraic Eigenvalue Problem, Oxford Univ. Press, Oxford, UK, 1965.