This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Analysis and Randomized Design of Algorithm-Based Fault Tolerant Multiprocessor Systems Under an Extended Model
July 1997 (vol. 8 no. 7)
pp. 757-768

Abstract—Reliability of compute-intensive applications can be improved by introducing fault tolerance into the system. Algorithm-based fault tolerance (ABFT) is a low-cost scheme which provides the required fault tolerance to the system through system level encoding. In this paper, we propose randomized construction techniques, under an extended model, for the design of ABFT systems with the required fault tolerance capability. The model considers failures in the processors performing the checking operations.

[1] K.H. Huang and J.A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," IEEE Trans. Computers, vol. 33, no. 6, pp. 518-528, June 1984.
[2] J.Y. Jou and J.A. Abraham, "Fault Tolerant Matrix Arithmetic and Signal-Processing on Highly Concurrent Computing Structures," Proc. IEEE, vol. 74, pp. 732-741, May 1986.
[3] P. Banerjee and J.A. Abraham, "Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems," IEEE Trans. Computers, Apr. 1986, pp. 296-306.
[4] V.S.S. Nair and J.A. Abraham, "A Model for the Analysis, Design and Comparison of Fault-Tolerant WSI Architectures," Proc. Workshop Wafer Scale Integration,Como, Italy, June 1989.
[5] S. Yajnik and N.K. Jha, "Design of Algorithm-Based Fault Tolerant Systems with In-System Checks," Proc. Int'l Conf. Parallel Processing,St. Charles, Ill., Aug. 1993.
[6] S. Yajnik and N.K. Jha, “Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 2, pp. 137-153, Feb. 1997.
[7] R. Sitaraman and N.K. Jha, "Optimal Design of Checks for Error Detection and Location in Fault Tolerant Multiprocessor Systems," IEEE Trans. Computers, vol. 42, no. 7, pp. 780-793, July 1993.
[8] A. Roy-Chowdhury and P. Banerjee, “Tolerance Determination for Algorithm-Based Checks Using Simplified Error Analysis Techniques,” Proc. 23rd IEEE Fault-Tolerant Computing Symp. (FTCS-23), pp. 290-298, June 1993.
[9] P. Banerjee and J.A. Abraham, "Concurrent Fault Diagnosis in Multiple Processor Systems," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 298-303,Vienna, June 1986.
[10] V.S.S. Nair and J.A. Abraham, "A Model for the Analysis of Fault Tolerant Signal Processing Architectures," Proc. Int'l Technical Symp. SPIE, pp. 246-257,San Diego, Aug. 1988.
[11] B. Vinnakota and N.K. Jha, "Diagnosability and Diagnosis of Algorithm-Based Fault Tolerant Systems," IEEE Trans. Computers, vol. 42, no. 8, pp. 924-937, Aug. 1993.
[12] B. Vinnakota, "Analysis, Design and Synthesis of Algorithm-Based Fault Tolerant Systems," PhD. thesis, Dept. of Electrical Eng., Princeton Univ., Oct. 1991.
[13] V.S.S. Nair and J.A. Abraham, “Hierarchical Design and Analysis of Fault-Tolerant Multiprocessor Systems Using Concurrent Error Detection,” Proc. 20th IEEE Fault-Tolerant Computing Symp. (FTCS-20), pp. 130-137, 1990.
[14] B. Vinnakota and N.K. Jha, “Design of Multiprocessor Systems for Concurrent Error Detection and Fault Diagnosis,” Proc. 21st IEEE Fault-Tolerant Computing Symp. (FTCS-21), pp. 504-511, 1991.
[15] J.A. Abraham et al., "Fault Tolerance Techniques for Systolic Arrays," Computer, pp. 65-74, July 1987.
[16] D. Gu, D.J. Rosenkrantz, and S.S. Ravi, “Design and Analysis of Test Schemes for Algorithm-Based Fault Tolerance,” Proc. 20th IEEE Fault-Tolerant Computing Symp. (FTCS-20), pp. 106-113, 1990.
[17] D.J. Rosenkrantz and S.S. Ravi, "Improved Bounds on Algorithm-Based Fault Tolerance," Proc. Ann. Allerton Conf. Comm., Cont. and Computers, pp. 388-397,Allerton, Ill., Sept. 1988.
[18] B. Vinnakota and N.K. Jha, "A Dependence Graph-Based Approach to the Design of Algorithm-Based Fault Tolerant Systems," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 122-129,Newcastle-upon-Tyne, U.K., June 1990.
[19] F.T. Luk and H. Park, "An Analysis of Algorithm-Based Fault Tolerance Techniques," Proc. SPIE Advanced Algorithms, Architecture, and Signal Processing, vol. 696, pp. 222-228, Aug. 1986.
[20] D.M. Blough and A. Pelc, "Almost Certain Fault Diagnosis through Algorithm-Based Fault Tolerance," Technical Report ECE-92-09, Dept. of Electrical and Computer Eng., Univ. of California, Irvine.
[21] R.K. Iyer and D.J. Rossetti, "Permanent CPU Errors and Systems Activity: Measurement and Modeling," Proc. Real-Time Systems Symp., pp. 61-72,Arlington, Va., Dec. 1983.

Index Terms:
Algorithm-based fault tolerance, concurrent error detection, concurrent fault location, randomized algorithms, fault diagnosis, transient faults.
Citation:
Shalini Yajnik, Niraj K. Jha, "Analysis and Randomized Design of Algorithm-Based Fault Tolerant Multiprocessor Systems Under an Extended Model," IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 7, pp. 757-768, July 1997, doi:10.1109/71.598349
Usage of this product signifies your acceptance of the Terms of Use.