This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
September 1990 (vol. 39 no. 9)
pp. 1132-1145

The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors.

[1] C. L. Seitz, "The Cosmic Cube,"Commun. ACM, pp. 22-33, Jan. 1985.
[2] J. C. Peterson, J. Tuazon, D. Lieberman, and M. Pniel, "The Mark III hypercube-ensemble concurrent computer," inProc. 1985 Parallel Processing Conf., Aug. 1985, pp. 71-73.
[3] I. Koren, "A reconfigurable and fault-tolerant VLSI multiprocessor array," inProc. 8th Int. Symp. Comput. Architecture, Minneapolis, MN, May 1981, pp. 425-442.
[4] D. K. Pradhan, "Fault-tolerant multiprocessor link and bus network architectures,"IEEE Trans. Comput., pp. 33-45, Jan. 1985.
[5] R. Negrini, M. Sami, and Stefanelli, "Fault tolerance techniques for array structures used in supercomputing,"IEEE Comput. Mag., pp. 78-87, Feb. 1986.
[6] D. A. Rennels, "On implementing fault tolerance in binary hypercubes," inProc. 16th Int. Symp. Fault-Tolerant Comput., Vienna, Austria, July 1986, pp. 344-349.
[7] J. G. Kuhl and S. M. Reddy, "Fault diagnosis in fully distributed systems," inProc. 11th Int. Symp. Fault-Tolerant Comput., June 1981, pp. 100-105.
[8] J. R. Armstrong and F. G. Gray, "Fault diagnosis in a Booleann- cube array of microprocessors,"IEEE Trans. Comput., vol. C-30, pp. 587-590, Aug. 1981.
[9] E. Dilger and E. Ammann, "System level self-diagnosis inn-cube connected multiprocessor networks," inProc. 14th Int. Symp. Fault Tolerant Comput., Kissimmee, FL, June 1984, pp. 184-189.
[10] R. K. Iyer and D. J. Rossetti, "Permanent CPU errors and system activity: Measurement and modeling," inProc. Real-Time Syst. Symp., 1983.
[11] D. A. Rennels, "Fault tolerant computing--Concepts and examples,"IEEE Trans. Comput., vol. C-33, pp. 1116-1129, Dec. 1984.
[12] K. H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C-33, pp. 518-528. June 1984.
[13] J. Y. Jou and J. A. Abraham, "Fault-tolerant matrix operations on multiple processor systems using weighted checksums,"SPIE Proc., Aug. 1984.
[14] J. Y. Jou and J. A. Abraham, "Fault tolerant FFT networks," inProc. 15th Int. Symp. Fault Tolerant Comput., Ann Arbor, MI, June 1985, pp. 338-343.
[15] M. Malek and Y. H. Choi, "A fault-tolerant FFT processor," inProc. 15th Fault-Tolerant Comput. Symp., Ann Arbor, MI, June 1985, pp. 266-271.
[16] F. Luk, "Algorithm-based fault tolerance for parallel matrix solvers," inProc. SPIE Real-Time Signal Processing VIII, vol. 564, 1985.
[17] P. Banerjee and J. A. Abraham, "Fault-secure algorithms for multiple processor systems," inProc. 11th Int. Symp. Comput. Architecture, June 1984, pp. 279-287.
[18] A. L. N. Reddy and P. Banerjee, "Algorithm-based fault detection for signal processing applications,"IEEE Trans. Comput., 1990.
[19] C.-Y. Chen and J. A. Abraham, "Fault-tolerant systems for the computation of eigenvalues and singular values,"Proc. SPIE, Advanced Algorithms Architectures Signal Processing, vol. 696, pp. 228-236, Aug. 1986.
[20] P. Banerjee and J. A. Abraham, "Bounds on algorithm-based fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C-35, pp. 296-306, Apr. 1986.
[21] P. Banerjee and J. A. Abraham, "Concurrent fault diagnosis in multiple processor systems," inProc. 16th Fault Tolerant Comput. Symp., Vienna, Austria, July 1986, pp. 298-303.
[22] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, and J. K. Salmon, inSolving Problems on Concurrent Processors. Englewood Cliffs, NJ: Prentice-Hall, 1989.
[23] C. Aykanat and F. Ozguner, "A concurrent error detecting conjugate gradient algorithm on a hypercube multiprocessor," inProc. 17th Int. Symp. Fault-Tolerant Comput., Pittsburgh, PA, July 1987, pp. 204-209.
[24] J.-C. Laprie, "Dependable computing and fault tolerance: Concepts and terminology," inProc. 15th Annu. Symp. Fault-Tolerant Comput., June 1985, pp. 2-11.
[25] M. Schuette and J. P. Shen, "Processor control flow monitoring using signatured instruction streams,"IEEE Trans. Comput., vol. C-36, pp. 264-276, Mar. 1987.
[26] A. Mahmood and E. J. McCluskey, "Concurrent error detection using watchdog processors--A survey,"IEEE Trans. Comput., pp. 160-174, Feb. 1988.
[27] C. J. Weinstein, "Roundoff noise in floating point fast Fourier transform computation,"IEEE Trans. Audio Electroacoust., vol. AU-17, pp. 209-215, Sept. 1969.
[28] G. A. Geist and M. T. Heath, "Matrix factorization on a hypercube multiprocessor," inProc. SIAM 1st Conf. Hypercube Multiprocessors, Knoxville, TN, Aug. 1985.

Index Terms:
fault tolerance; hypercube multiprocessor; multiprocessor architecture; faulty processors; error detection; Intel iPSC hypercube; matrix multiplication; Gaussian elimination; fast Fourier transform; fault tolerant computing; multiprocessing systems; parallel architectures.
Citation:
P. Banerjee, J.T. Rahmeh, C. Stunkel, V.S. Nair, K. Roy, V. Balasubramanian, J.A. Abraham, "Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor," IEEE Transactions on Computers, vol. 39, no. 9, pp. 1132-1145, Sept. 1990, doi:10.1109/12.57055
Usage of this product signifies your acceptance of the Terms of Use.