
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
J. Rexford, N.K. Jha, "Partitioned Encoding Schemes for AlgorithmBased Fault Tolerance in Massively Parallel Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 6, pp. 649653, June, 1994.  
BibTex  x  
@article{ 10.1109/71.285610, author = {J. Rexford and N.K. Jha}, title = {Partitioned Encoding Schemes for AlgorithmBased Fault Tolerance in Massively Parallel Systems}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {5}, number = {6}, issn = {10459219}, year = {1994}, pages = {649653}, doi = {http://doi.ieeecomputersociety.org/10.1109/71.285610}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  Partitioned Encoding Schemes for AlgorithmBased Fault Tolerance in Massively Parallel Systems IS  6 SN  10459219 SP649 EP653 EPD  649653 A1  J. Rexford, A1  N.K. Jha, PY  1994 KW  Index Termsfault tolerant computing; software reliability; error correction codes; error detectioncodes; parallel architectures; matrix algebra; algorithm based fault tolerance; massivelyparallel systems; partitioned encoding; ABET; scalability; matrix algorithms; partitionedscheme; checksum code; error detection; error correction; transient errors VL  5 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Considers the applicability of algorithm based fault tolerance (ABET) to massively parallel scientific computation. Existing ABET schemes can provide effective fault tolerance at a low cost For computation on matrices of moderate size; however, the methods do not scale well to floatingpoint operations on large systems. This short note proposes the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABET schemes. It is shown that the partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead.
[1] K.H. Huang and J. A. Abraham, "Algorithmbased fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C33, pp. 518528, June 1984.
[2] J.Y. Jou and J. A. Abraham, "Fault tolerant matrix arithmetic and signal processing on highly concurrent computing structures,"Proc. IEEE, vol. 74, no. 5, pp. 732741, May 1986.
[3] F. T. Luk and H. Park, "An analysis of algorithmbased fault tolerance techniques," inProc. SPIE Adv. Alg.&Arch. for Signal Proc., vol. 696, 1986, pp. 222228.
[4] A. L. N. Reddy and P. Banerjee, "Algorithmbased fault detection for signal processing applications,"IEEE Trans. Comput., vol. 39, pp. 13041308, Oct. 1990.
[5] B. Vinnakota and N. K. Jha, "A dependence graphbased approach to the design of algorithmbased fault tolerant systems," inProc. Int. Symp. Fault Tolerant Comput., NewcastleuponTyne, U.K., June 1990, pp. 122129.
[6] J.Y. Jou and J. A. Abraham, "Fault tolerant FFT networks,"IEEE Trans. Comput., vol. 37, pp. 548561, May 1988.
[7] Y.H. Choi and M. Malek, "A fault tolerant FFT processor,"IEEE Trans. Comput., vol. 37, pp. 617621, May 1988.
[8] SJ. Wang and N. K. Jha, "Algorithmbased fault tolerance for FFT networks," inProc. Int. Symp. Circuits Systems, San Diego, May 1992.
[9] Y.H. Choi and M. Malek, "A fault tolerant systolic sorter,"IEEE Trans. Comput., vol 37, pp. 621624, May 1988.
[10] P. Banerjee and J. A. Abraham, "Bounds on algorithmbased fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C35, pp. 296306, Apr. 1986.
[11] V. S. S. Nair and J. A. Abraham, "A model for the analysis of fault tolerant signal processing architectures," inProc. 32nd Int. Tech. Symp. of SPIE, 1988, pp. 246257.
[12] B. Vinnakota and N. K. Jha, "Diagnosability and diagnosis of algorithmbased fault tolerant systems," accepted for publication inIEEE Trans. Comput.vol. 42, pp. 924937, Aug. 1993.
[13] C. J. Anfinson and F. T. Luk, "A linear algebraic model of algorithmbased fault tolerance,"IEEE Trans. Comput., vol. 37, pp. 15991604, Dec. 1988.
[14] V. S. S. Nair and J. A. Abraham, "General linear codes for fault tolerant matrix operations on processor arrays," inProc. Int. Symp. FaultTolerant Comput., Tokyo, June 1988, pp. 180185.
[15] W. Bliss, M. Lightner, and B. Friedlander, "Numerical properties of algorithmbased fault tolerance for high reliability array processors," inProc. 22nd Asilomar Conf, Signals, Syst.,&Comput., vol. 1, 1981, pp. 631635.
[16] F. Luk, "Algorithmbased fault tolerance for parallel matrix equation solvers,"SPIE, vol. 564,RealTime Signal Processing VIII, 1985, pp. 4953.
[17] J. A. Abrahamet al., "Fault tolerance techniques for systolic arrays,"IEEE Comput. Mag., vol. 20, pp. 6574, July 1987.
[18] D. J. Rosenkrantz and S. S. Ravi, "Improved upper bounds for algorithmbased fault tolerance," inProc. 26th Allerton Conf. Comm. Cont.&Comput., 1988, pp. 388397.
[19] P. Banerjeeet al., "An evaluation of systemlevel fault tolerance on the intel hypercube multiprocessor," inProc. 18th Int. Symp. FaultTolerant Comput., 1988, pp. 362367.
[20] V. S. S. Nair and J. A. Abraham, "A model for the analysis, design and comparison of faulttolerant WSI architectures," inProc. Workshop Wafer Scale Integration, 1989.
[21] V. S. S. Nair and J. A. Abraham, "Hierarchical design and analysis of faulttolerant multiprocessor systems using concurrent error detection," inProc. 20th Int. Symp. FaultTolerant Comput., (FTCS20), Newcastle upon Tyne, June 1990, pp. 130137.
[22] D. Gu, J. Rosenkrantz, and S. S. Ravi, "Design and analysis of test schemes for algorithmbased fault tolerance," inProc. 20th Int. Symp. FaultTolerant Comput., Newcastle, England, June 2628, 1990, pp. 106113.
[23] A. M. Cohen,Numerical Analysis. New York: Wiley, 1973.