This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems
July 1993 (vol. 42 no. 7)
pp. 780-793

RANDGEN, a simple and efficient general-purpose algorithm for generating arbitrary data-check (DC) graphs with a small number of checks, which satisfy a variety of properties that have been found to be useful in algorithm-based fault tolerance (ABFT) designs, is proposed. The concept of majority diagnosability is introduced in an attempt to explicitly redesign DC graphs for easy diagnosis. UNIFGEN, a variation of RANDGEN that produces DC graphs with uniform checks is examined.

[1] J. A. Abrahamet al., "Fault tolerance techniques for systolic arrays,"IEEE Comput. Mag., vol. 20, pp. 65-74, July 1987.
[2] B. Bollobas,Random Graphs. New York: Academic Press, 1985.
[3] P. Banerjeeet al., "An evaluation of system-level fault tolerance on the intel hypercube multiprocessor," inProc. 18th Int. Symp. Fault-Tolerant Comput., 1988, pp. 362-367.
[4] P. Banerjee and J. A. Abraham, "Bounds on algorithm-based fault tolerance in multiple processor systems,"IEEE Trans. Comput., vol. C-35, pp. 296-306, Apr. 1986.
[5] P. Banerjee and J. A. Abraham, "A probabilistic model of algorithm-based fault tolerance in array processors for real-time systems," inProc. Real-Time Systems Symp., 1986, pp. 72-78.
[6] H. Chernoff, "A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations,"Annals of Math. Stat., vol. 23, pp. 493-509, 1952.
[7] C-Y. Chen and J. A. Abraham, "Fault-tolerant systems for the computation of eigenvalues and singular values," inProc. SPIE Adv. Alg. Arch. Signal Proc., vol. 696, pp. 228-237, Aug. 1986.
[8] Y-H. Choi and M. Malek, "A fault tolerant FFT processor,"IEEE Trans. Comput., vol. C-37, pp. 617-621, May 1988.
[9] Y-H. Choi and M. Malek, "A fault tolerant systolic sorter,"IEEE Trans. Comput., vol. C-37, pp. 621-624, May 1988.
[10] P. Erdos and J. Spencer,The Probabilistic Method in Combinatorics. New York: Academic Press, 1974.
[11] W. Feller,An Introduction to Probability Theory and its Applications, vol. I. New York: John Wiley, 1968.
[12] D. Gu, J. Rosenkrantz, and S. S. Ravi, "Design and analysis of test schemes for algorithm-based fault tolerance," inProc. 20th Int. Symp. Fault-Tolerant Comput., Newcastle, England, June 26-28, 1990, pp. 106-113.
[13] K-H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations,"IEEE Trans. Comput., vol. C-33, pp. 518-528, June 1984.
[14] J-Y. Jou and J. A. Abraham, "Fault tolerant matrix arithmetic and signal processing on highly concurrent computing structures,"Proc. IEEE, vol. 74, pp. 732-741, May 1986.
[15] J-Y. Jou and J. A. Abraham, "Fault tolerant FFT networks,"IEEE Trans. Comput., vol. C-37, pp. 548-561, May 1988.
[16] F. T. Luk, "Algorithm-based fault tolerance for parallel matrix equations solvers," inProc. SPIE Real Time Signal Proc., vol. 564, Aug. 1985, pp. 49-53.
[17] F. T. Luk and H. Park, "An analysis of algorithm-based fault tolerance techniques," inProc. SPIE Adv. Alg. Arch. Signal Proc., vol. 696, pp. 222-228, Aug. 1986.
[18] F. T. Luk and H. Park, "Fault tolerant matrix triangularizations on systolic arrays,"IEEE Trans. Comput., vol. C-37, pp. 1434-1438, Nov. 1988.
[19] V. S. S. Nair and J. A. Abraham, "A model for the analysis of fault tolerant signal processing architectures," inProc. 32nd Int. Tech. Symp. SPIE, San Diego, Aug. 1988, pp. 246-257.
[20] V. S. S. Nair and J. A. Abraham, "A model for the analysis, design and comparison of fault-tolerant WSI architectures," inProc. Workshop on Wafer Scale Integration, Como, Italy, June 1989.
[21] V. S. S. Nair and J. A. Abraham, "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection," inProc. 20th Int. Symp. Fault-Tolerant Comput., (FTCS-20), Newcastle upon Tyne, June 1990, pp. 130-137.
[22] P. Raghavan, Lecture notes on randomized algorithms, IBM Tech. Rep. RC15340, T. J. Watson Research Center, Yorktown Heights, NY, Jan. 1990, pp. 51-55.
[23] A. L. N. Reddy and P. Banerjee, "Algorithm-based fault detection for signal processing applications,"IEEE Trans. Comput., vol. C-39, pp. 1304-1308, Oct. 1990.
[24] D. J. Rosenkrantz and S. S. Ravi, "Improved upper bounds for algorithm-based fault tolerance," inProc. 26th Allerton Conf. Comm. Cont. Comput., Allerton, IL, Sept. 1988, pp. 388-397.
[25] D. L. Tao, C. R. P. Hartmann, and Y. S. Chen, "A novel concurrent error detection scheme for FFT networks," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 114-121.
[26] B. Vinnakota and N. K. Jha, "Diagnosability and diagnosis of algorithm-based fault tolerant systems," accepted for publication inIEEE Trans. Comput.
[27] B. Vinnakota and N. K. Jha, "A dependence graph-based approach to the design of algorithm-based fault tolerant systems," inProc. Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., June 1990, pp. 122-129.
[28] B. Vinnakota and N. K. Jha, "Design of multiprocessor systems for concurrent error detection and fault diagnosis," inProc. Int. Symp. Fault Tolerant Comput., Montreal, June 1991.
[29] J. D. Russel and C. R. Kime, "System fault diagnosis: Closure and diagnosability with repair,"IEEE Trans. Comput., vol. C-24, pp. 1078-1089, Nov. 1975.
[30] S-J. Wang and N. K. Jha, "Algorithm-based fault tolerance for FFT networks," inProc. Int. Symp. Circuits Systems, San Diego, May 1992.

Index Terms:
error location; optimal design of checks; error detection; fault-tolerant multiprocessor systems; RANDGEN; arbitrary data-check; algorithm-based fault tolerance; majority diagnosability; UNIFGEN; uniform checks; error detection; fault tolerant computing; multiprocessing systems.
Citation:
R.K. Sitaraman, N.K. Jha, "Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems," IEEE Transactions on Computers, vol. 42, no. 7, pp. 780-793, July 1993, doi:10.1109/12.237719
Usage of this product signifies your acceptance of the Terms of Use.