This Article 
 Bibliographic References 
 Add to: 
Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors
February 1990 (vol. 16 no. 2)
pp. 183-196

The authors provide an in-depth study of the various issues and tradeoffs available in algorithm-based error detection, as well as a general methodology for evaluating the schemes. They illustrate the approach on an extremely useful computation in the field of numerical linear algebra: QR factorization. They have implemented and investigated numerous ways of applying algorithm-based error detection using different system-level encoding strategies for QR factorization. Specifically, schemes based on the checksum and sum-of-squares (SOS) encoding techniques have been developed. The results of studies performed on a 16-processor Intel iPSC-2/D4/MX hypercube multiprocessor are reported. It is shown that, in general, the SOS approach gives much better coverage (85-100%) for QR factorization while maintaining low overheads (below 10%).

[1] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, and J. K. Salmon,Solving Problems on Concurrent Processors. Englewood Cliffs, NJ: Prentice-Hall, 1989.
[2] J. G. Kuhl and S. M. Reddy, "Fault diagnosis in fully distributed systems," inProc. 11th Int. Symp. Fault-Tolerant Computing, June 1981, pp. 100-105.
[3] J. R. Armstrong and F. G. Gray, "Fault diagnosis in a Booleann-cube array of multiprocessors,"IEEE Trans. Comput., vol. C-30, pp. 587- 590, Aug. 1981.
[4] E. Dilger and E. Ammann, "System level self-diagnosis inn-cube connected multiprocessor networks," inProc. 14th Int. Symp. Fault Tolerant Computing, Kissimmee, FL, June 1984, pp. 184-189.
[5] R. K. Iyer and D. J. Rossetti, "Permanent CPU errors and system activity: Measurement and modeling," inProc. Real-Time Systems Symp., 1983.
[6] J. A. Abrahamet al., "Fault tolerance techniques for systolic arrays,"IEEE Comput. Mag., vol. 20, pp. 65-74, July 1987.
[7] K. H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C-33, pp. 518- 528, June 1984.
[8] J. Y. Jou and J. A. Abraham, "Fault tolerant matrix arithmetic and signal processing on highly concurrent computing structures,"Proc. IEEE (Special Issue on Fault Tolerance in VLSI), vol. 74, pp. 732- 741, May 1986.
[9] V. S. S. Nair and J. A. Abraham, "General linear codes for fault tolerant matrix operations on processor arrays," inProc. Int. Symp. Fault-Tolerant Comput., Tokyo, June 1988, pp. 180-185.
[10] J. Y. Jou and J. A. Abraham, "Fault tolerant FFT networks," inProc. 15th Int. Symp. Fault-Tolerant Computing, Ann Arbor, MI, June 1985.
[11] M. Malek and Y. H. Choi, "A fault-tolerant FFT processor," inProc. 15th Int. Symp. Fault-Tolerant Computing, Ann Arbor, MI, June 1985.
[12] F. T. Luk, "Algorithm-based fault tolerance for parallel matrix solvers,"Proc. SPIE, vol. 564 (Real Time Signal Processing VIII), 1985.
[13] P. Banerjee and J. A. Abraham, "Fault-secure algorithms for multiple processor systems," inProc. 11th Int. Symp. Comput. Architecture, June 1984, pp. 279-287.
[14] A. L. N. Reddy and P. Banerjee, "'Algorithm-based fault detection techniques in signal processing applications,"IEEE Trans. Comput., to be published.
[15] C.-Y. Chen and J. A. Abraham, "Fault-tolerant systems for the computation of eigenvalues and singular values," inProc. SPIE Conf., Aug. 1986, pp. 228-237.
[16] P. Banerjee and J. A. Abraham, "Bounds on algorithm-based fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C-35, pp. 296-306, Apr. 1986.
[17] P. Banerjee and J. A. Abraham, "Concurrent fault diagnosis in multiple processor systems," inProc. 16th Int. Symp. Fault-Tolerant Computing, Vienna, Austria, July 1986, pp. 298-303.
[18] P. Banerjee and J. A. Abraham, "A probabilistic model of algorithm-based fault detection and tolerance in array processors for real-time systems," inProc. Real-Time Systems Symp., New Orleans, LA, Dec. 1986, pp. 72-78.
[19] C. J. Anfinson and F. T. Luk, "A linear algebraic model of algorithm-based fault tolerance,"IEEE Trans. Comput., vol. 37, pp. 1599-1604, Dec. 1988.
[20] P. Banerjee and C. Stunkel, "A novel approach to system-level fault tolerance in hypercube multiprocessors," inProc. 3rd ACM Conf. Hypercube Concurrent Computers and Applications, Pasadena, CA, Jan. 1988, pp. 307-311.
[21] P. Banerjeeet al., "An evaluation of system-level fault tolerance on the intel hypercube multiprocessor," inProc. 18th Int. Symp. Fault-Tolerant Comput., 1988, pp. 362-367.
[22] V. Balasubramanian and P. Banerjee, "Algorithm-based error detection for signal processing applications on a hypercube multiprocessor," inProc. 10th Real-Time Systems Symp., Dec. 1989.
[23] R. B. Mueller-Thuns, P. McFarland, and P. Banerjee, "Algorithm-based fault tolerance for adaptive least squares lattice filtering on a hypercube multiprocessor," inProc. 18th Int. Conf. Parallel Proc., St. Charles, IL, Aug. 1989.
[24] C. Aykanat and F. Ozguner, "A concurrent error detecting conjugant gradient algorithm on a hypercube multiprocessor," inProc. 17th Int. Symp. Fault-Tolerant Computing, Pittsburgh, PA, July 1987, pp. 204- 209.
[25] D. M. Andrews, "Using executable assertions for testing and fault tolerance," inProc. 9th Int. Symp. Fault-Tolerant Computing, Madison, WI, June 1979, pp. 102-105.
[26] B. Randell, "System structure for fault tolerance,"IEEE Trans. Software Eng., vol. SE-1, pp. 220-232, June 1975.
[27] A. Avizienis and J.-C. Laprie, "Dependable computing: From concepts to design diversity,"Proc. IEEE, vol. 74, May 1986.
[28] C. J. Weinstein, "Roundoff noise in floating point fast Fourier transform computation,"IEEE Trans. Audio Electroacoust., vol. AU-17, pp. 209-215, Sept. 1969.
[29] G. H. Golub and C. F. Van Loan, inMatrix Computations. Baltimore, MD: Johns Hopkins University Press, 1983.
[30] W. M. Gentleman, "Error analysis of QR decomposition by Givens transforms,"Linear Algebra Appl., vol. 25, pp. 189-197, 1975.
[31] A. H. Sameh and D. J. Kuck, "On stable parallel linear solver,"J. ACM, vol. 25, pp. 81-91, 1978.
[32] A. Pothen, S. Jha, and U. Vemulapati, "Orthogonal factorization on a distributed memory multiprocessor, inProc. 2nd SIAM Conf. Hypercube Computers and Applications, 1987, pp. 587-596.
[33] J. Y. Jou and J. A. Abraham, "Fault-tolerant matrix operations on multiple processor systems using weighted checksums,"Proc. SPIE, vol. 495, Aug. 1984.
[34] W. M. Gentleman and H. T. Kung, "Matrix triangularization by systolic arrays,"Proc. SPIE, vol. 298 (Real Time Signal Processing IV), pp. 19-26, 1981.
[35] V. Balasubramanian and P. Banerjee, "Compiler-assisted synthesis of algorithm-based checking in multiprocessors,"IEEE Trans. Comput. (Special Issue on Fault-Tolerant Computing), Apr. 1990, to be published.
[36] W. Harrison, "An overview of the structure of parafrase,"Univ. Illinois, Urbana-Champaign, CSRD Rep. 501, PR-85-2, UILU-ENG- 85-8002, July 1985.

Index Terms:
hypercube multiprocessors; algorithm-based error detection; numerical linear algebra; QR factorization; encoding; checksum; sum-of-squares; 16-processor Intel iPSC-2/D4/MX; encoding; error detection; linear algebra; multiprocessing systems; software engineering.
V. Balasubramanian, P. Banerjee, "Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors," IEEE Transactions on Software Engineering, vol. 16, no. 2, pp. 183-196, Feb. 1990, doi:10.1109/32.44381
Usage of this product signifies your acceptance of the Terms of Use.