This Article 
 Bibliographic References 
 Add to: 
Node-covering, Error-correcting Codes and Multiprocessors with Very High Average Fault Tolerance
September 1997 (vol. 46 no. 9)
pp. 997-1015

AbstractStructural fault tolerance (SFT) is the ability of a multiprocessor to reconfigure around faulty processors or links in order to preserve its original processor interconnection structure. In this paper, we focus on the design of SFT multiprocessors that have low switch and link overheads, but can tolerate a very large number of processor faults on the average. Most previous work has concentrated on deterministic k-fault-tolerant (k-FT) designs in which exactly k spare processors and some spare switches and links are added to construct multiprocessors that can tolerate any k processor faults. However, after k faults are reconfigured around, much of the extra links and switches can remain unutilized. It is possible within the basic node-covering framework, which was introduced by Dutt and Hayes as an efficient k-FT design method, to design FT multiprocessors that have the same amount of switches and links as, say, a two-FT deterministic design, but have s spare processors, where $s \gg 2,$ so that, on the average, k = Θ(s) (ks) processor failures can be reconfigured around. Such designs utilize the spare link and switch capacity very efficiently, and are called probabilistic FT designs. An elegant and powerful method to construct covering graphs or CG's, which are key to obtaining the probabilistic FT designs, is to use linear error-correcting codes (ECCs). We show how to construct probabilistic designs with very high average fault tolerance but low wiring and switch overhead using ECCs like the 2D-parity, full-two, 3D-parity, and full-three codes. This design methodology is applicable to any multiprocessor interconnection topology and the resulting FT designs have the same node degree as the non-FT target topology. We also analyze the deterministic fault tolerance for these designs and develop efficient layout strategies for them. Finally, we compare the proposed probabilistic designs to some of the best deterministic and probabilistic designs proposed in the past, and show that our designs can meet a given mean-time-to-failure (MTTF) specification at much lower hardware costs (switch complexity, redundant wiring area, and spare-processor overhead) than previous designs. Further, for a given number of spare processors, our designs have close-to-optimal reconfigurabilities that are much better than those of previous probabilistic designs.

[1] B. Arazi, A Commonsense Approach to the Theory of Error-correcting Codes. MIT Press, 1988.
[2] P. Banerjee, S.Y. Kuo, and W.K. Fuchs, "Reconfigurable Cube-Connected Cycles Architecture," Proc. 16th Fault Tolerant Computing Symp., pp. 286-291, June 1986.
[3] R.A. Brualdi, Introductory Combinatorics, pp. 155-158.New York: NorthHolland, 1977.
[4] J. Bruck, R. Cypher, and C.-T. Ho, "Wildcard Dimensions, Coding Theory and Fault-Tolerant Meshes and Hypercubes," Proc. 23rd Int'l Symp. Fault-Tolerant Computing, pp. 260-267, June 1993.
[5] J. Bruck, R. Cypher, and C.-T. Ho, "Fault-Tolerant Meshes and Hypercubes with Minimal Numbers of Spares," IEEE Trans. Computers, vol. 42, no. 9, pp. 1,089-1,104, Sept. 1993.
[6] J. Bruck, R. Cypher, and C.-T. Ho, "Fault-Tolerant Meshes with Small Degree," Proc. ACM Symp. Parallel Algorithms and Architectures, pp. 1-10, 1993.
[7] S.-C. Chau and A.L. Liestman, "A Proposal for a Fault-Tolerant Binary Hypercube Architecture," Proc. 19th Int'l Symp. Fault-Tolerant Computing, pp. 323-330, June 1989.
[8] F.R.K. Chung, F.T. Leighton, and A.L. Rosenberg, "Diogenes: A Methodology for Designing Fault-Tolerant VLSI Processor Arrays," Proc. 13th Fault Tolerant Computing Symp., pp. 26-31, June 1983.
[9] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGraw-Hill, 1990.
[10] S. Dutt and J.P. Hayes, "An Automorphic Approach to the Design of Fault-Tolerant Multiprocessors," Proc. 19th Fault Tolerant Computing Symp., pp. 496-503,Chicago, June 1989.
[11] S. Dutt and J.P. Hayes, “On Designing and Reconfiguring K-Fault-Tolerant Tree Architectures,” IEEE Trans. Computers, vol. 39, no. 4, pp. 490–503, Apr. 1990.
[12] S. Dutt and J.P. Hayes, “Designing Fault-Tolerant Systems Using Auto-morphisms,” J. Parallel and Distributed Computing, vol. 12, no. 3, pp. 249–268, 1991.
[13] S. Dutt and J.P. Hayes, "A Local-Sparing Design Methodology for Fault-Tolerant Multiprocessors," to appear in the Special Issue on Graph Theory in Computer Science, Chemistry, and Other Fields of Computers and Mathematics with Applications, Elsevier Science.
[14] S. Dutt and J.P. Hayes, “Some Practical Issues in the Design of Fault-Tolerant Multiprocessors,” IEEE Trans. Computers, vol. 41, no. 5, pp. 588–598, May 1992.
[15] S. Dutt, "Fast Polylog-Time Reconfiguration of Structurally Fault-Tolerant Multiprocessors," Proc. Fifth IEEE Symp. Parallel and Distributed Processing, pp. 762-770, Dec. 1993.
[16] S. Dutt and N.R. Mahapatra, "Node-covering, Error-correcting Codes and Multiprocessors with Very High Average Fault Tolerance," technical report, Univ. of Minnesota, Minneapolis, 1996—accessible at ftp site:
[17] S.H. Friedberg, A.J. Insel, and L.E. Spence, Linear Algebra.Englewood Cliffs, N.J.: Prentice Hall, 1979.
[18] G.A. Gibson, L. Hellerstein, R.M. Karp, R.H. Katz, and D.A. Patterson, "Failure Correction Techniques for Large Disk Arrays," Proc. ASPLOS '89, pp. 123-132, 1989.
[19] J.P. Hayes, "A Graph Model for Fault Tolerant Computing Systems," IEEE Trans. Computers, vol. 25, no. 9, pp. 875-883, Sept. 1976.
[20] J.E. Hopcroft and R.M. Karp, "An n5/2Algorithm for Maximum Matching in Bipartite Graphs," SIAM J. Computing, vol. 2, pp. 225-231, 1973.
[21] S.Y. Kung, VLSI Array Processors. Prentice Hall, 1988.
[22] F. Lombardi, M.G. Sami, and R. Stefanelli, "Reconfiguration of VLSI Arrays by Covering," IEEE Trans. Computers, vol. 8, no. 9, pp. 952-965, Sept. 1989.
[23] C.S. Raghavendra, A. Avizienis, and M.D. Ercegovac, "Fault Tolerance in Binary Tree Architectures," IEEE Trans. Computers, vol. 33, no. 6, pp. 568-572, June 1984.
[24] D.A. Rennels, "On Implementing Fault-Tolerance in Binary Hypercubes," Proc. 16th Fault Tolerant Computing Symp., pp. 344-349, June 1986.
[25] V.P. Roychowdhury, J. Bruck, and T. Kailath, "Efficient Algorithms for Reconfiguration in VLSI/WSI Arrays," IEEE Trans. Computers, vol. 39, no. 4, pp. 480-489, Apr. 1990.
[26] M. Sami and R. Stefanelli, "Reconfigurable Architectures for VLSI Processing Arrays," Proc. IEEE, vol. 74, no. 5, pp. 712-722, May 1986.
[27] H. Tamaki, "Construction of the Mesh and the Torus Tolerating a Large Number of Faults," Proc. ACM Symp. Parallel Algorithms and Architectures, pp. 268-277, 1994.
[28] T.A. Varvarigou, V.P. Roychowdhury, and T. Kailath, "Reconfiguring Arrays Using Multiple-Track Models: The 3-Track-1-Spare Approach," IEEE Trans. Computers, vol. 42, no. 11, Nov. 1993.
[29] L. Weil, M. Pecht, and E. Hakim, "Reliability Evaluation of Plastic Encapsulated Parts," IEEE Trans. Reliability, vol. 42, no. 4, pp. 536-540, Dec. 1993.

Index Terms:
Average fault tolerance, deterministic fault tolerance, fault-tolerant multiprocessors, linear error-correcting codes, matching, network flow, node-covering, VLSI layout, reconfiguration.
Shantanu Dutt, Nihar R. Mahapatra, "Node-covering, Error-correcting Codes and Multiprocessors with Very High Average Fault Tolerance," IEEE Transactions on Computers, vol. 46, no. 9, pp. 997-1015, Sept. 1997, doi:10.1109/12.620481
Usage of this product signifies your acceptance of the Terms of Use.