This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Tests and Tolerances for High-Performance Software-Implemented Fault Detection
May 2003 (vol. 52 no. 5)
pp. 579-591

Abstract—We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form; the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision floating-point calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computational burden tradeoffs, and we emphasize average-case algorithm behavior rather than using worst-case upper bounds on error.

[1] M. Blum and H. Wasserman, "Reflections on the Pentium Division Bug," IEEE Trans. Computers, vol. 45, no. 4, pp. 385-393, Apr. 1996.
[2] P.E. Dodd et al. “Single-Event Upset and Snapback in Silicon-on-Insulator Devices and Integrated Circuits,” IEEE Trans. Nuclear Science, vol. 47, no. 6, pp. 2165-2174, 2000.
[3] M. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability—A Study of Field Failures in Operating Systems," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 2-9, 1991.
[4] S.G. Eick, T.L. Graves, A.F. Karr, J.S. Marron, and A. Mockus, Does Code Decay? Assessing the Evidence from Change Management Data IEEE Trans. Software Eng., vol. 27, no. 1, pp. 1-12, Jan./Feb. 2001.
[5] M. Frigo and S.G. Johnson, “FFTW: An Adaptive Software Architecture for the FFT,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 3, p. 1381, 1998.
[6] J. Conlon, “Losing Limits in Space Exploration,” Insights, pp. 22-25, see alsohttp://faculty.erau.edu/laforgel/http:/ree.jpl.nasa.gov. , Nov. 1998.
[7] F. Chen, L. Craymer, J. Deifik, A.J. Fogel, D.S. Katz, A.G. Silliman Jr., R.R. Some, S.A. Upchurch, and K. Whisnant, Demonstration of the Remote Exploration and Experimentation (REE) Fault-Tolerant Parallel-Processing Supercomputer for Spacecraft Onboard Scientific Data Processing Proc. Int'l Conf. Dependable Systems and Networks, pp. 367-372, 2000.
[8] H.S. Stockman and J. Mather, “NGST: Seeing the First Stars and Galaxies Form,” Galaxy Interactions at Low and High Redshift, Proc. IAU Symp. 186, Kluwer, pp. 493-499, 1999.
[9] T. Murphy and R. Lyon, “NGST Autonomous Optical Control System,” Space Telescope Science Inst., Mar. 1998.
[10] L.S. Blackford et al. ScaLAPACK, Users' Guide. SIAM Press, 1997.
[11] R.A. van de Geijn, P. Alpatov, G. Baker, and C. Edwards, Using PLAPACK: Parallel Linear Algebra Package. MIT Press, 1997.
[12] G.H. Golub and C.F. Van Loan, Matrix Computations, Baltimore, Johns Hopkins, second ed., 1989.
[13] N.J. Higham, Analysis and Stability of Numerical Algorithms. SIAM Press, 1996.
[14] H. Wasserman and M. Blum, “Software Reliability via Run-Time Result-Checking,” J. ACM, vol. 44, no. 6, pp. 826-849, 1997.
[15] R. Freivalds, “Fast Probabilistic Algorithms,” Proc. Eighth Symp. Math. Foundations Computer Science, pp. 57-69, vol. 74, 1979.
[16] M. Blum and S. Kannan, “Designing Programs that Check Their Work,” Proc. 21st Symp. Theoretical Computing, pp. 86-97, 1989.
[17] M. Blum, M. Luby, and R. Rubinfeld, “Self-Testing/Correcting with Applications to Numerical Problems,” J. Computer and System Sciences, vol. 47, no. 3, pp. 549-595, 1993.
[18] K.-H. Huang and J.A. Abraham, “Algorithm-Based Fault Tolerance for Matrix Operations,” IEEE Trans. Computers, vol. 33, no. 6, pp. 518-528, June 1984.
[19] J.-Y. Jou and J.A. Abraham, “Fault-Tolerant Matrix Arithmetic and Signal Processing on Highly Concurrent Computing Structures,” Proc. IEEE, vol. 74, no. 5, pp. 732-741, 1986.
[20] F.T. Luk and H. Park, “An Analysis of Algorithm-Based Fault Tolerance Techniques,” J. Parallel and Distributed Computing, vol. 5, pp. 172-184, 1988.
[21] M.P. Connolly and P. Fitzpatrick, "Fault Tolerant QRD Recursive Least Squares," IEE Proc.-E Computers and Digital Techniques, 143, pp. 137-144, 1996.
[22] Y. Choi and M. Malek,“A fault-tolerant FFT processor,” IEEE Trans. Computers, vol. 37, pp. 617-621, May 1988.
[23] S.J. Wang and N.K. Jha, Algorithm-Based Fault Tolerance for FFT Networks IEEE Trans. Computers, vol. 43, no. 7, pp. 849-854, July 1994.
[24] J.G. Silva, P. Prata, and M. Rela, H. Madeira, “Practical Issues in the Use of ABFT and a New Failure Model,” Proc. 28th Int'l Symp. Fault-Tolerant Computing, pp. 26-35, 1998.
[25] D.L. Boley, R.P. Brent, G.H. Golub, and F.T. Luk, “Algorithmic Fault Tolerance using the Lanczos Method,” SIAM J. Matrix Analysis and Applications, vol. 13, no. 1, pp. 312-332, 1992.
[26] D.L. Boley and F.T. Luk, “A Well-Conditioned Checksum Scheme for Algorithmic Fault Tolerance,” Integration, The VLSI J., vol. 12, pp. 21-32, 1991.
[27] A. Roy-Chowdhury and P. Banerjee,"A New Error Analysis Based Method for Tolerance Computation for Algorithm-Based Checks," IEEE Trans. Computers, vol. 45, no. 2, pp. 238-243, Feb. 1996.
[28] A. Roy-Chowdhury and P. Banerjee, “Tolerance Determination for Algorithm-Based Checks Using Simplified Error Analysis Techniques,” Proc. 23rd IEEE Fault-Tolerant Computing Symp. (FTCS-23), pp. 290-298, June 1993.
[29] D.L. Boley,G.H. Golub,S. Makar,N. Saxena, and E.J. McCluskey,"Floating Point Fault Tolerance Using Backward Error Assertions," IEEE Trans. Computers, Special Issue on Fault-Tolerant Computing, vol. 44, no. 2, pp. 302-311, Feb. 1995.
[30] J.A. Gunnels et al., "Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice," Proc. Int'l Conf. Dependable Systems and Networks, IEEE Press, July 2001, pp. 47-56.
[31] L.N. Trefethen and D. Bau, Numerical Linear Algebra. SIAM Press, 1997.
[32] G.W. Stewart, “The Efficient Generation of Random Orthogonal Matrices with an Application to Condition Estimators,” SIAM J. Numerical Analysis, vol. 17, no. 3, pp. 403-409, 1980.
[33] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes in C. second ed. Cambridge Univ., 1992.
[34] N.L. Johnson, S. Kotz, and A.W. Kemp, Univariate Discrete Distributions. second ed. New York: Wiley, second ed., 1991.
[35] M. Turmon, R. Granat, and D.S. Katz, “Software-Implemented Fault Detection for High-Performance Space Applications,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 107-116, 2000.

Index Terms:
Algorithm-based fault tolerance, result checking, error analysis, aerospace, parallel numerical algorithms.
Citation:
Michael Turmon, Robert Granat, Daniel S. Katz, John Z. Lou, "Tests and Tolerances for High-Performance Software-Implemented Fault Detection," IEEE Transactions on Computers, vol. 52, no. 5, pp. 579-591, May 2003, doi:10.1109/TC.2003.1197125
Usage of this product signifies your acceptance of the Terms of Use.