P. Banerjee, J.T. Rahmeh, C. Stunkel, V.S. Nair, K. Roy, V. Balasubramanian, J.A. Abraham, "AlgorithmBased Fault Tolerance on a Hypercube Multiprocessor," IEEE Transactions on Computers, vol. 39, no. 9, pp. 11321145, September, 1990.  
The design of faulttolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithmbased error detection. Systemlevel error detection mechanisms have been implemented for three parallel applications on a 16processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the systemlevel error detection schemes in the presence of finiteprecision arithmetic, which affects the systemlevel encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors.
