Issue No. 09 - September (1990 vol. 39)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/12.57055
<p>The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors.</p>
fault tolerance; hypercube multiprocessor; multiprocessor architecture; faulty processors; error detection; Intel iPSC hypercube; matrix multiplication; Gaussian elimination; fast Fourier transform; fault tolerant computing; multiprocessing systems; parallel architectures.
K. Roy et al., "Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor," in IEEE Transactions on Computers, vol. 39, no. , pp. 1132-1145, 1990.