This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Partitioned Encoding Schemes for Algorithm-Based Fault Tolerance in Massively Parallel Systems
June 1994 (vol. 5 no. 6)
pp. 649-653

Considers the applicability of algorithm based fault tolerance (ABET) to massively parallel scientific computation. Existing ABET schemes can provide effective fault tolerance at a low cost For computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. This short note proposes the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABET schemes. It is shown that the partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead.

[1] K.-H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C-33, pp. 518-528, June 1984.
[2] J.-Y. Jou and J. A. Abraham, "Fault tolerant matrix arithmetic and signal processing on highly concurrent computing structures,"Proc. IEEE, vol. 74, no. 5, pp. 732-741, May 1986.
[3] F. T. Luk and H. Park, "An analysis of algorithm-based fault tolerance techniques," inProc. SPIE Adv. Alg.&Arch. for Signal Proc., vol. 696, 1986, pp. 222-228.
[4] A. L. N. Reddy and P. Banerjee, "Algorithm-based fault detection for signal processing applications,"IEEE Trans. Comput., vol. 39, pp. 1304-1308, Oct. 1990.
[5] B. Vinnakota and N. K. Jha, "A dependence graph-based approach to the design of algorithm-based fault tolerant systems," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 122-129.
[6] J.-Y. Jou and J. A. Abraham, "Fault tolerant FFT networks,"IEEE Trans. Comput., vol. 37, pp. 548-561, May 1988.
[7] Y.-H. Choi and M. Malek, "A fault tolerant FFT processor,"IEEE Trans. Comput., vol. 37, pp. 617-621, May 1988.
[8] S-J. Wang and N. K. Jha, "Algorithm-based fault tolerance for FFT networks," inProc. Int. Symp. Circuits Systems, San Diego, May 1992.
[9] Y.-H. Choi and M. Malek, "A fault tolerant systolic sorter,"IEEE Trans. Comput., vol 37, pp. 621-624, May 1988.
[10] P. Banerjee and J. A. Abraham, "Bounds on algorithm-based fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C-35, pp. 296-306, Apr. 1986.
[11] V. S. S. Nair and J. A. Abraham, "A model for the analysis of fault tolerant signal processing architectures," inProc. 32nd Int. Tech. Symp. of SPIE, 1988, pp. 246-257.
[12] B. Vinnakota and N. K. Jha, "Diagnosability and diagnosis of algorithm-based fault tolerant systems," accepted for publication inIEEE Trans. Comput.vol. 42, pp. 924-937, Aug. 1993.
[13] C. J. Anfinson and F. T. Luk, "A linear algebraic model of algorithm-based fault tolerance,"IEEE Trans. Comput., vol. 37, pp. 1599-1604, Dec. 1988.
[14] V. S. S. Nair and J. A. Abraham, "General linear codes for fault tolerant matrix operations on processor arrays," inProc. Int. Symp. Fault-Tolerant Comput., Tokyo, June 1988, pp. 180-185.
[15] W. Bliss, M. Lightner, and B. Friedlander, "Numerical properties of algorithm-based fault tolerance for high reliability array processors," inProc. 22nd Asilomar Conf, Signals, Syst.,&Comput., vol. 1, 1981, pp. 631-635.
[16] F. Luk, "Algorithm-based fault tolerance for parallel matrix equation solvers,"SPIE, vol. 564,Real-Time Signal Processing VIII, 1985, pp. 49-53.
[17] J. A. Abrahamet al., "Fault tolerance techniques for systolic arrays,"IEEE Comput. Mag., vol. 20, pp. 65-74, July 1987.
[18] D. J. Rosenkrantz and S. S. Ravi, "Improved upper bounds for algorithm-based fault tolerance," inProc. 26th Allerton Conf. Comm. Cont.&Comput., 1988, pp. 388-397.
[19] P. Banerjeeet al., "An evaluation of system-level fault tolerance on the intel hypercube multiprocessor," inProc. 18th Int. Symp. Fault-Tolerant Comput., 1988, pp. 362-367.
[20] V. S. S. Nair and J. A. Abraham, "A model for the analysis, design and comparison of fault-tolerant WSI architectures," inProc. Workshop Wafer Scale Integration, 1989.
[21] V. S. S. Nair and J. A. Abraham, "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection," inProc. 20th Int. Symp. Fault-Tolerant Comput., (FTCS-20), Newcastle upon Tyne, June 1990, pp. 130-137.
[22] D. Gu, J. Rosenkrantz, and S. S. Ravi, "Design and analysis of test schemes for algorithm-based fault tolerance," inProc. 20th Int. Symp. Fault-Tolerant Comput., Newcastle, England, June 26-28, 1990, pp. 106-113.
[23] A. M. Cohen,Numerical Analysis. New York: Wiley, 1973.

Index Terms:
Index Termsfault tolerant computing; software reliability; error correction codes; error detectioncodes; parallel architectures; matrix algebra; algorithm based fault tolerance; massivelyparallel systems; partitioned encoding; ABET; scalability; matrix algorithms; partitionedscheme; checksum code; error detection; error correction; transient errors
Citation:
J. Rexford, N.K. Jha, "Partitioned Encoding Schemes for Algorithm-Based Fault Tolerance in Massively Parallel Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 6, pp. 649-653, June 1994, doi:10.1109/71.285610
Usage of this product signifies your acceptance of the Terms of Use.