
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
P. Banerjee, J.T. Rahmeh, C. Stunkel, V.S. Nair, K. Roy, V. Balasubramanian, J.A. Abraham, "AlgorithmBased Fault Tolerance on a Hypercube Multiprocessor," IEEE Transactions on Computers, vol. 39, no. 9, pp. 11321145, September, 1990.  
BibTex  x  
@article{ 10.1109/12.57055, author = {P. Banerjee and J.T. Rahmeh and C. Stunkel and V.S. Nair and K. Roy and V. Balasubramanian and J.A. Abraham}, title = {AlgorithmBased Fault Tolerance on a Hypercube Multiprocessor}, journal ={IEEE Transactions on Computers}, volume = {39}, number = {9}, issn = {00189340}, year = {1990}, pages = {11321145}, doi = {http://doi.ieeecomputersociety.org/10.1109/12.57055}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Computers TI  AlgorithmBased Fault Tolerance on a Hypercube Multiprocessor IS  9 SN  00189340 SP1132 EP1145 EPD  11321145 A1  P. Banerjee, A1  J.T. Rahmeh, A1  C. Stunkel, A1  V.S. Nair, A1  K. Roy, A1  V. Balasubramanian, A1  J.A. Abraham, PY  1990 KW  fault tolerance; hypercube multiprocessor; multiprocessor architecture; faulty processors; error detection; Intel iPSC hypercube; matrix multiplication; Gaussian elimination; fast Fourier transform; fault tolerant computing; multiprocessing systems; parallel architectures. VL  39 JA  IEEE Transactions on Computers ER   
The design of faulttolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithmbased error detection. Systemlevel error detection mechanisms have been implemented for three parallel applications on a 16processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the systemlevel error detection schemes in the presence of finiteprecision arithmetic, which affects the systemlevel encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors.
[1] C. L. Seitz, "The Cosmic Cube,"Commun. ACM, pp. 2233, Jan. 1985.
[2] J. C. Peterson, J. Tuazon, D. Lieberman, and M. Pniel, "The Mark III hypercubeensemble concurrent computer," inProc. 1985 Parallel Processing Conf., Aug. 1985, pp. 7173.
[3] I. Koren, "A reconfigurable and faulttolerant VLSI multiprocessor array," inProc. 8th Int. Symp. Comput. Architecture, Minneapolis, MN, May 1981, pp. 425442.
[4] D. K. Pradhan, "Faulttolerant multiprocessor link and bus network architectures,"IEEE Trans. Comput., pp. 3345, Jan. 1985.
[5] R. Negrini, M. Sami, and Stefanelli, "Fault tolerance techniques for array structures used in supercomputing,"IEEE Comput. Mag., pp. 7887, Feb. 1986.
[6] D. A. Rennels, "On implementing fault tolerance in binary hypercubes," inProc. 16th Int. Symp. FaultTolerant Comput., Vienna, Austria, July 1986, pp. 344349.
[7] J. G. Kuhl and S. M. Reddy, "Fault diagnosis in fully distributed systems," inProc. 11th Int. Symp. FaultTolerant Comput., June 1981, pp. 100105.
[8] J. R. Armstrong and F. G. Gray, "Fault diagnosis in a Booleann cube array of microprocessors,"IEEE Trans. Comput., vol. C30, pp. 587590, Aug. 1981.
[9] E. Dilger and E. Ammann, "System level selfdiagnosis inncube connected multiprocessor networks," inProc. 14th Int. Symp. Fault Tolerant Comput., Kissimmee, FL, June 1984, pp. 184189.
[10] R. K. Iyer and D. J. Rossetti, "Permanent CPU errors and system activity: Measurement and modeling," inProc. RealTime Syst. Symp., 1983.
[11] D. A. Rennels, "Fault tolerant computingConcepts and examples,"IEEE Trans. Comput., vol. C33, pp. 11161129, Dec. 1984.
[12] K. H. Huang and J. A. Abraham, "Algorithmbased fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C33, pp. 518528. June 1984.
[13] J. Y. Jou and J. A. Abraham, "Faulttolerant matrix operations on multiple processor systems using weighted checksums,"SPIE Proc., Aug. 1984.
[14] J. Y. Jou and J. A. Abraham, "Fault tolerant FFT networks," inProc. 15th Int. Symp. Fault Tolerant Comput., Ann Arbor, MI, June 1985, pp. 338343.
[15] M. Malek and Y. H. Choi, "A faulttolerant FFT processor," inProc. 15th FaultTolerant Comput. Symp., Ann Arbor, MI, June 1985, pp. 266271.
[16] F. Luk, "Algorithmbased fault tolerance for parallel matrix solvers," inProc. SPIE RealTime Signal Processing VIII, vol. 564, 1985.
[17] P. Banerjee and J. A. Abraham, "Faultsecure algorithms for multiple processor systems," inProc. 11th Int. Symp. Comput. Architecture, June 1984, pp. 279287.
[18] A. L. N. Reddy and P. Banerjee, "Algorithmbased fault detection for signal processing applications,"IEEE Trans. Comput., 1990.
[19] C.Y. Chen and J. A. Abraham, "Faulttolerant systems for the computation of eigenvalues and singular values,"Proc. SPIE, Advanced Algorithms Architectures Signal Processing, vol. 696, pp. 228236, Aug. 1986.
[20] P. Banerjee and J. A. Abraham, "Bounds on algorithmbased fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C35, pp. 296306, Apr. 1986.
[21] P. Banerjee and J. A. Abraham, "Concurrent fault diagnosis in multiple processor systems," inProc. 16th Fault Tolerant Comput. Symp., Vienna, Austria, July 1986, pp. 298303.
[22] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, and J. K. Salmon, inSolving Problems on Concurrent Processors. Englewood Cliffs, NJ: PrenticeHall, 1989.
[23] C. Aykanat and F. Ozguner, "A concurrent error detecting conjugate gradient algorithm on a hypercube multiprocessor," inProc. 17th Int. Symp. FaultTolerant Comput., Pittsburgh, PA, July 1987, pp. 204209.
[24] J.C. Laprie, "Dependable computing and fault tolerance: Concepts and terminology," inProc. 15th Annu. Symp. FaultTolerant Comput., June 1985, pp. 211.
[25] M. Schuette and J. P. Shen, "Processor control flow monitoring using signatured instruction streams,"IEEE Trans. Comput., vol. C36, pp. 264276, Mar. 1987.
[26] A. Mahmood and E. J. McCluskey, "Concurrent error detection using watchdog processorsA survey,"IEEE Trans. Comput., pp. 160174, Feb. 1988.
[27] C. J. Weinstein, "Roundoff noise in floating point fast Fourier transform computation,"IEEE Trans. Audio Electroacoust., vol. AU17, pp. 209215, Sept. 1969.
[28] G. A. Geist and M. T. Heath, "Matrix factorization on a hypercube multiprocessor," inProc. SIAM 1st Conf. Hypercube Multiprocessors, Knoxville, TN, Aug. 1985.