This Article 
 Bibliographic References 
 Add to: 
Adaptive System-Level Diagnosis for Hypercube Multiprocessors
October 1996 (vol. 45 no. 10)
pp. 1157-1170

Abstract—System-level diagnosis is an important technique for fault detection and location in multiprocessor computing systems. Efficient diagnosis is highly desirable for sustaining the original system power. Moreover, effective diagnosis is particularly important for a multiprocessor system with high scalability but low connectivity. Most of the existing results are not applicable in practice because of the high diagnosis cost and limited diagnosability. Over-d fault diagnosis, where d is the diagnosability, has only been addressed using a probabilistic method in the literature. Aiming at these two issues, we propose a hierarchical adaptive system-level diagnosis approach for hypercube systems using a divide-and-conquer strategy. We first propose a conceptual algorithm HADA to formulate a rigorous analysis. Then we present its practical variant IHADA. In HADA and IHADA, the over-d fault problem is inherently tackled through a deterministic method. Three measures for diagnosis cost (diagnosis time, number of tests, and number of test links) are analyzed for the proposed algorithms. It is proved that the diagnosis cost required by our approach is lower than in previous diagnosis algorithms. It is shown that the diagnosis cost for the proposed algorithms depends on the number and location of faulty units in the system and the cost is extremely low when only a small number of faulty units exist. It is also shown that our algorithms are characterized by lower costs than a pessimistic diagnosis algorithm which trades lower diagnosis cost for a lower degree of accuracy. Experimental results on the nCUBE are provided to substantiate the practicality of the proposed approach.

[1] C. Feng, L.N. Bhuyan, and F. Lombardi, "An Adaptive System-Level Diagnosis Approach for Mesh Connected Multiprocessors," Proc. Int'l Conf. Parallel Processing, vol. 3, pp. 153-157, 1993.
[2] J. Rattner, "Concurrent Processing: A New Direction in Scientific Computing," AFIPS Conf. Proc., pp. 157-166, 1985.
[3] "nCUBE 2 Processor Manual," nCUBE Corp., 1990.
[4] N.H. Vaidya and D.K. Pradhan, "Safe System Level Diagnosis," IEEE Trans. Computers, vol. 43, no. 3, pp. 367-370, Mar. 1994.
[5] A.K. Somani and V.K. Agarwal, "Distributed Diagnosis Algorithms for Regular Interconnected Structures," IEEE Trans. Computers, vol. 41, no. 7, pp. 899-906, July 1992.
[6] F.P. Preparata, G. Metze, and R.T. Chien, "On the Connection Assignment Problem of Diagnosable Systems," IEEE Trans. Electronic Computers, no. 12, pp. 848-854, Dec. 1967.
[7] J. Armstrong and F. Gray, "Fault Diagnosis in a Boolean n-Cube Array of Microprocessors," IEEE Trans. Computers, vol. 30, no. 8, pp. 587-590, Aug. 1981.
[8] A.D. Friedman, "A New Measure of Digital System Diagnosis," Proc. Fifth Int'l Symp. Fault-Tolerant Computing, pp. 167-169, June 1975.
[9] N. Nakajima, "A New Approach to System Diagnosis," Proc. 19th Allerton Conf. Comm., Control, and Computing, pp. 697-706, 1981.
[10] A. Kavianpour and K.H. Kim, "A Comparative Evaluation of Four Basic System-Level Diagnosis Strategies for Hypercubes," IEEE Trans. Reliability, vol. 41, pp. 26-37, Mar. 1992.
[11] A.T. Dahbura and G.M. Masson, "An O(n2.5) Fault Identification Algorithm for Diagnosable Systems," IEEE Trans. Computers, vol. 33, no. 6, pp. 486-492, June 1984.
[12] M. Stahl, R. Buskens, and J.R. Bianchini, "On-Line Diagnosis in General Topology Networks," Proc. IEEE Workshop Fault-Tolerant Parallel and Distributed Systems, pp. 114-121, 1992.
[13] A. Bagchi and S.L. Hakimi, "An Optimal Algorithm for Distributed System Level Diagnosis," Proc. IEEE CS 21st Int'l Symp. Fault-Tolerant Computing, pp. 214-221, 1991.
[14] D.P. Bertsekas and J.N. Tsitsiklis, Parallel and Distributed Computation.Englewood Cliffs, N.J.: Prentice Hall International, 1989.
[15] F.J. Meyer and D.K. Pradhan, "Dynamic Testing Strategy for Distributed Systems," IEEE Trans. Computers, vol. 38, no. 3, pp. 356-365, Mar. 1989.
[16] S. Karunanithi and A.D. Friedman, "Analysis of Digital System Using a New Measure of System Diagnosis," IEEE Trans. Computers, vol. 28, no. 2, pp. 121-133, Feb. 1979.
[17] A.K. Somani, V.K. Agarwal, and D. Avis, "A Generalized Theory for System Level Diagnosis," IEEE Trans. Computers, vol. 36, no. 5, pp. 538-546, May 1987.
[18] D. Blough, S. Sullivan, and G. Masson, "Fault Diagnosis of Sparcely Interconnected Multiprocessor Systems," Proc. 19th Int'l Symp. Fault-Tolerant Computing, pp. 62-69, 1989.
[19] D. Fussell and S. Rangarajan, "Probabilistic Diagnosis of Multiprocessor Systems with Arbitrary Connectivity," Proc. 19th Int'l Symp. Fault-Tolerant Computing, pp. 560-565, 1989.
[20] P. Berman and A. Pelc, "Distributed Probabilistic Fault Diagnosis for Multiprocessor Systems," Proc. 20th Int'l Symp. Fault-Tolerant Computing, pp. 340-346, 1990.
[21] L.N. Bhuyan and D.P. Agrawal, "Generalized Hypercube and Hyperbus Structures for a Computer Network," IEEE Trans. Computers, vol. 33, no. 4, pp. 323-333, Apr. 1984.
[22] S.B. Akers and B. Krishnamurthy, “A Group-Theoretic Model for Symmetric Interconnection Networks,” IEEE Trans. Computers, vol. 38, no. 4, pp. 555-566, Apr. 1989.
[23] M. Blount, "Probabilistic Treatment of Diagnosis in Digital Systems," Proc. Seventh Int'l Symp. Fault-Tolerant Computing, pp. 72-77, June 1977.
[24] A. Dahbura, K. Sabnani, and L. King, "The Comparison Approach to Multiprocessor Fault Diagnosis," IEEE Trans. Computers, vol. 36, no. 3, pp. 373-378, Mar. 1987.
[25] S.L. Hakimi and K. Nakajima, "On Adaptive System Diagnosis," IEEE Trans. Computers, vol. 33, no. 3, pp. 234-240, Mar. 1984.
[26] E. Schmeichel, S. Hakimi, M. Otsuka, and G. Sullivan, "On Minimizing Testing Rounds for Fault Identification," Proc. 18th Int'l Symp. Fault-Tolerant Computing, pp. 266-271, July 1988.

Index Terms:
Reliable system design, system-level diagnosis, adaptive diagnosis, hypercube.
Chao Feng, Laxmi N. Bhuyan, Fabrizio Lombardi, "Adaptive System-Level Diagnosis for Hypercube Multiprocessors," IEEE Transactions on Computers, vol. 45, no. 10, pp. 1157-1170, Oct. 1996, doi:10.1109/12.543709
Usage of this product signifies your acceptance of the Terms of Use.