This Article 
 Bibliographic References 
 Add to: 
Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays
April 1990 (vol. 39 no. 4)
pp. 426-435

A generalization of existing real numer codes is proposed. It is proven that linearity is a necessary and sufficient condition for codes used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU decomposition. It is also proven that for every linear code defined over a finite field, there exists a corresponding linear real-number code with similar error detecting capabilities. Encoding schemes are given for some of the example codes which fall under the general set of real-number codes. With the help of experiments, a rule is derived for the selection of a particular code for a given application. The performance overhead of fault tolerance schemes using the generalized encoding schemes is shown to be very low, and this is substantiated through simulation experiments.

[1] A. Avizienis, "Fault-tolerance: The survival attribute of digital systems,"Proc. IEEE, vol. 66, pp. 1109-1125, Oct. 1978.
[2] S. E. Butner, "Triple time redundancy, fault-masking in byte-sliced systems," in Tech. Rep. CSL TR 211, Comput. Syst. Lab., Dep. of Elec. Eng., Stanford Univ., Stanford, CA, Aug. 1981.
[3] J. H. Patel and L. Y. Fung, "Concurrent error detection in ALU's by recomputing with shifted operands,"IEEE Trans. Comput., vol. C-31, pp. 589-595, July 1982.
[4] J. Wakerly,Error-Detecting Codes, Self-Checking Circuits and Applications. New York: Elsevier North Holland, 1978.
[5] P. Banerjee and J. A. Abraham, "Fault-secure algorithms for multiple processor systems," inProc. 11th Int. Symp. Comput. Architecture, June 1984, pp. 279-287.
[6] K. H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C-33, pp. 518-528, June 1984.
[7] F. T. Luk and H. Park, "Fault-tolerant matrix triangularizations on systolic arrays," Tech. Rep. EE-CEG-86-2, Feb. 1986.
[8] J. Y. Jou and J. A. Abraham, "Fault-tolerant FFT networks,"IEEE Trans. Comput., vol. 37, pp. 548-561, May 1988.
[9] C. Y. Chen and J. A. Abraham, "Fault-tolerant systems for the computation of eigenvalues and singular values,"Proc. SPIE, Advanced Algorithms and Architectures for Signal Processing, vol. 696, pp. 228-237, Aug. 1986.
[10] J. Y. Jou and J. A. Abraham, "Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures,"Proc. IEEE, vol. 74, no. 5, pp. 732-741, May 1986.
[11] J. L. Larson, "Methods for automatic error analysis of numerical algorithms," Rep. UIUCDCS-R-78-916, Urbana, IL, Apr. 1978.
[12] W. W. Peterson and E. J. Weldon, Jr.,Error-Correcting Codes. Cambridge, MA: MIT Press, 1981.
[13] J. Y. Jou and J. A. Abraham, "Fault-tolerant algorithms and architectures for real time signal processing," inProc. Int. Conf. Parallel Processing, vol. 1, Aug. 1988, pp. 359-362.
[14] P. Banerjee, J. T. Rahmeh, C. B. Stunkel, V. S. S. Nair, K. Roy, and J. A. Abraham, "Algorithm-based fault tolerance on a hypercube multiprocessor,"IEEE Trans. Comput., to be published.
[15] J. H. Wilkinson,The Algebraic Eigenvalue Problem. London, England: Oxford University Press, 1965.
[16] R. A. Willoughby, "Sparse matrix algorithms and their relation to problem classes and computer architectures," inLarge Sparse Sets of Linear Equations. New York: 1971, pp. 257-277.
[17] J. Larson and A. Sameh, "Efficient calculation of the effects of rounding errors,"ACM Trans. Math. Software, vol. 4, pp. 228-236, 1978.
[18] A. M. Cohen,Numerical Analysis. New York: Wiley, 1973.
[19] R. E. Blahut,Theory and Practice of Error Control Codes. Reading, MA: Addison-Wesley, May 1984.
[20] B. Bose and T. R. N. Rao, "Theory of unidirectional error correcting/detecting codes,"IEEE Trans. Comput., vol. C-31, pp. 521-530, June 1982.
[21] C. W. Curtis,Linear Algebra. New York: Springer-Verlag, 1984.
[22] T. G. Marshall Jr., "Coding of real number sequences for error correction: A digital signal processing problem,"IEEE J. Select. Areas Commun., vol. SAC-2, no. 2, pp. 381-392, Mar. 1984.
[23] V. S. S. Nair and J. A. Abraham, "Average checksum codes for fault-tolerant matrix operations on processor arrays," inProc. Int. Conf. Supercomput., vol. 3, Santa Clara, CA, May 5-9, 1987, pp. 284-290.
[24] W. Ronsch, "Stability aspects in using parallel algorithms,"Parallel Comput., vol. 1, pp. 75-98, Aug. 1984.
[25] L. Snyder,Poker Programming Manual, University of Washington, Seattle, WA, 1984.
[26] J. A. Abraham, "Fault tolerance techniques for highly parallel signal processing architectures,"SPIE Highly Parallel Signal Processing Architectures, vol. 614, pp. 49-65, 1986.
[27] E. Gallopoulos, "Processor arrays for problems in computational physics," Ph.D. dissertation, Univ. of Illinois, Urbana, IL, 1985.
[28] V. S. S. Nair, "General linear codes for fault-tolerant matrix operations on processor arrays," M.S. thesis, Univ. of Illinois, Urbana, IL, Aug. 1988.
[29] W. Kahan, "Further remarks on reducing truncation errors,"Commun. ACM, vol. 8, pp. 40-48, 1965.
[30] V. S. S. Nair and J. A. Abraham, "General linear codes for fault tolerant matrix operations on processor arrays," inProc. Int. Symp. Fault-Tolerant Comput., Tokyo, June 1988, pp. 180-185.
[31] U. Kulish and G. Bohlender,Features of Hardware Implementation of an Optimal Arithmetic. New York: Academic, 1983.

Index Terms:
real number codes; encoding; fault-tolerant matrix operations; processor arrays; linearity; necessary and sufficient condition; multiplication; transposition; LU decomposition; error detecting; performance overhead; simulation experiments; encoding; error detection codes; fault tolerant computing.
V.S.S. Nair, J.A. Abraham, "Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays," IEEE Transactions on Computers, vol. 39, no. 4, pp. 426-435, April 1990, doi:10.1109/12.54836
Usage of this product signifies your acceptance of the Terms of Use.