Issue No.09 - September (1990 vol.39)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/12.57055
<p>The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors.</p>
fault tolerance; hypercube multiprocessor; multiprocessor architecture; faulty processors; error detection; Intel iPSC hypercube; matrix multiplication; Gaussian elimination; fast Fourier transform; fault tolerant computing; multiprocessing systems; parallel architectures.
J.T. Rahmeh, C. Stunkel, V.S. Nair, K. Roy, V. Balasubramanian, J.A. Abraham, "Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor", IEEE Transactions on Computers, vol.39, no. 9, pp. 1132-1145, September 1990, doi:10.1109/12.57055