
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Michael Turmon, Robert Granat, Daniel S. Katz, John Z. Lou, "Tests and Tolerances for HighPerformance SoftwareImplemented Fault Detection," IEEE Transactions on Computers, vol. 52, no. 5, pp. 579591, May, 2003.  
BibTex  x  
@article{ 10.1109/TC.2003.1197125, author = {Michael Turmon and Robert Granat and Daniel S. Katz and John Z. Lou}, title = {Tests and Tolerances for HighPerformance SoftwareImplemented Fault Detection}, journal ={IEEE Transactions on Computers}, volume = {52}, number = {5}, issn = {00189340}, year = {2003}, pages = {579591}, doi = {http://doi.ieeecomputersociety.org/10.1109/TC.2003.1197125}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Computers TI  Tests and Tolerances for HighPerformance SoftwareImplemented Fault Detection IS  5 SN  00189340 SP579 EP591 EPD  579591 A1  Michael Turmon, A1  Robert Granat, A1  Daniel S. Katz, A1  John Z. Lou, PY  2003 KW  Algorithmbased fault tolerance KW  result checking KW  error analysis KW  aerospace KW  parallel numerical algorithms. VL  52 JA  IEEE Transactions on Computers ER   
Abstract—We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithmbased fault tolerance (ABFT) methods may be used, for example, to overcome singleevent upsets in computational hardware or to detect errors in complex, highefficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form; the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finiteprecision floatingpoint calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computational burden tradeoffs, and we emphasize averagecase algorithm behavior rather than using worstcase upper bounds on error.
[1] M. Blum and H. Wasserman, "Reflections on the Pentium Division Bug," IEEE Trans. Computers, vol. 45, no. 4, pp. 385393, Apr. 1996.
[2] P.E. Dodd et al. “SingleEvent Upset and Snapback in SilicononInsulator Devices and Integrated Circuits,” IEEE Trans. Nuclear Science, vol. 47, no. 6, pp. 21652174, 2000.
[3] M. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability—A Study of Field Failures in Operating Systems," Proc. Int'l Symp. FaultTolerant Computing, pp. 29, 1991.
[4] S.G. Eick, T.L. Graves, A.F. Karr, J.S. Marron, and A. Mockus, Does Code Decay? Assessing the Evidence from Change Management Data IEEE Trans. Software Eng., vol. 27, no. 1, pp. 112, Jan./Feb. 2001.
[5] M. Frigo and S.G. Johnson, “FFTW: An Adaptive Software Architecture for the FFT,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 3, p. 1381, 1998.
[6] J. Conlon, “Losing Limits in Space Exploration,” Insights, pp. 2225, see alsohttp://faculty.erau.edu/laforgel/http:/ree.jpl.nasa.gov. , Nov. 1998.
[7] F. Chen, L. Craymer, J. Deifik, A.J. Fogel, D.S. Katz, A.G. Silliman Jr., R.R. Some, S.A. Upchurch, and K. Whisnant, Demonstration of the Remote Exploration and Experimentation (REE) FaultTolerant ParallelProcessing Supercomputer for Spacecraft Onboard Scientific Data Processing Proc. Int'l Conf. Dependable Systems and Networks, pp. 367372, 2000.
[8] H.S. Stockman and J. Mather, “NGST: Seeing the First Stars and Galaxies Form,” Galaxy Interactions at Low and High Redshift, Proc. IAU Symp. 186, Kluwer, pp. 493499, 1999.
[9] T. Murphy and R. Lyon, “NGST Autonomous Optical Control System,” Space Telescope Science Inst., Mar. 1998.
[10] L.S. Blackford et al. ScaLAPACK, Users' Guide. SIAM Press, 1997.
[11] R.A. van de Geijn, P. Alpatov, G. Baker, and C. Edwards, Using PLAPACK: Parallel Linear Algebra Package. MIT Press, 1997.
[12] G.H. Golub and C.F. Van Loan, Matrix Computations, Baltimore, Johns Hopkins, second ed., 1989.
[13] N.J. Higham, Analysis and Stability of Numerical Algorithms. SIAM Press, 1996.
[14] H. Wasserman and M. Blum, “Software Reliability via RunTime ResultChecking,” J. ACM, vol. 44, no. 6, pp. 826849, 1997.
[15] R. Freivalds, “Fast Probabilistic Algorithms,” Proc. Eighth Symp. Math. Foundations Computer Science, pp. 5769, vol. 74, 1979.
[16] M. Blum and S. Kannan, “Designing Programs that Check Their Work,” Proc. 21st Symp. Theoretical Computing, pp. 8697, 1989.
[17] M. Blum, M. Luby, and R. Rubinfeld, “SelfTesting/Correcting with Applications to Numerical Problems,” J. Computer and System Sciences, vol. 47, no. 3, pp. 549595, 1993.
[18] K.H. Huang and J.A. Abraham, “AlgorithmBased Fault Tolerance for Matrix Operations,” IEEE Trans. Computers, vol. 33, no. 6, pp. 518528, June 1984.
[19] J.Y. Jou and J.A. Abraham, “FaultTolerant Matrix Arithmetic and Signal Processing on Highly Concurrent Computing Structures,” Proc. IEEE, vol. 74, no. 5, pp. 732741, 1986.
[20] F.T. Luk and H. Park, “An Analysis of AlgorithmBased Fault Tolerance Techniques,” J. Parallel and Distributed Computing, vol. 5, pp. 172184, 1988.
[21] M.P. Connolly and P. Fitzpatrick, "Fault Tolerant QRD Recursive Least Squares," IEE Proc.E Computers and Digital Techniques, 143, pp. 137144, 1996.
[22] Y. Choi and M. Malek,“A faulttolerant FFT processor,” IEEE Trans. Computers, vol. 37, pp. 617621, May 1988.
[23] S.J. Wang and N.K. Jha, AlgorithmBased Fault Tolerance for FFT Networks IEEE Trans. Computers, vol. 43, no. 7, pp. 849854, July 1994.
[24] J.G. Silva, P. Prata, and M. Rela, H. Madeira, “Practical Issues in the Use of ABFT and a New Failure Model,” Proc. 28th Int'l Symp. FaultTolerant Computing, pp. 2635, 1998.
[25] D.L. Boley, R.P. Brent, G.H. Golub, and F.T. Luk, “Algorithmic Fault Tolerance using the Lanczos Method,” SIAM J. Matrix Analysis and Applications, vol. 13, no. 1, pp. 312332, 1992.
[26] D.L. Boley and F.T. Luk, “A WellConditioned Checksum Scheme for Algorithmic Fault Tolerance,” Integration, The VLSI J., vol. 12, pp. 2132, 1991.
[27] A. RoyChowdhury and P. Banerjee,"A New Error Analysis Based Method for Tolerance Computation for AlgorithmBased Checks," IEEE Trans. Computers, vol. 45, no. 2, pp. 238243, Feb. 1996.
[28] A. RoyChowdhury and P. Banerjee, “Tolerance Determination for AlgorithmBased Checks Using Simplified Error Analysis Techniques,” Proc. 23rd IEEE FaultTolerant Computing Symp. (FTCS23), pp. 290298, June 1993.
[29] D.L. Boley,G.H. Golub,S. Makar,N. Saxena, and E.J. McCluskey,"Floating Point Fault Tolerance Using Backward Error Assertions," IEEE Trans. Computers, Special Issue on FaultTolerant Computing, vol. 44, no. 2, pp. 302311, Feb. 1995.
[30] J.A. Gunnels et al., "FaultTolerant HighPerformance Matrix Multiplication: Theory and Practice," Proc. Int'l Conf. Dependable Systems and Networks, IEEE Press, July 2001, pp. 4756.
[31] L.N. Trefethen and D. Bau, Numerical Linear Algebra. SIAM Press, 1997.
[32] G.W. Stewart, “The Efficient Generation of Random Orthogonal Matrices with an Application to Condition Estimators,” SIAM J. Numerical Analysis, vol. 17, no. 3, pp. 403409, 1980.
[33] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes in C. second ed. Cambridge Univ., 1992.
[34] N.L. Johnson, S. Kotz, and A.W. Kemp, Univariate Discrete Distributions. second ed. New York: Wiley, second ed., 1991.
[35] M. Turmon, R. Granat, and D.S. Katz, “SoftwareImplemented Fault Detection for HighPerformance Space Applications,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 107116, 2000.