Issue No.10 - October (1996 vol.45)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/12.543709
<p><b>Abstract</b>—System-level diagnosis is an important technique for fault detection and location in multiprocessor computing systems. Efficient diagnosis is highly desirable for sustaining the original system power. Moreover, effective diagnosis is particularly important for a multiprocessor system with high scalability but low connectivity. Most of the existing results are not applicable in practice because of the high diagnosis cost and limited diagnosability. Over-<it>d</it> fault diagnosis, where <it>d</it> is the diagnosability, has only been addressed using a probabilistic method in the literature. Aiming at these two issues, we propose a hierarchical adaptive system-level diagnosis approach for hypercube systems using a divide-and-conquer strategy. We first propose a conceptual algorithm HADA to formulate a rigorous analysis. Then we present its practical variant IHADA. In HADA and IHADA, the over-<it>d</it> fault problem is inherently tackled through a deterministic method. Three measures for diagnosis cost (diagnosis time, number of tests, and number of test links) are analyzed for the proposed algorithms. It is proved that the diagnosis cost required by our approach is lower than in previous diagnosis algorithms. It is shown that the diagnosis cost for the proposed algorithms depends on the number and location of faulty units in the system and the cost is extremely low when only a small number of faulty units exist. It is also shown that our algorithms are characterized by lower costs than a pessimistic diagnosis algorithm which trades lower diagnosis cost for a lower degree of accuracy. Experimental results on the nCUBE are provided to substantiate the practicality of the proposed approach.</p>
Reliable system design, system-level diagnosis, adaptive diagnosis, hypercube.
Laxmi N. Bhuyan, Chao Feng, "Adaptive System-Level Diagnosis for Hypercube Multiprocessors", IEEE Transactions on Computers, vol.45, no. 10, pp. 1157-1170, October 1996, doi:10.1109/12.543709