Issue No.06 - June (2012 vol.23)
Mourad Elhadef , Abu Dhabi University, Abu Dhabi
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2011.248
We consider the fault identification problem, also known as the system-level self-diagnosis, in multiprocessor and multicomputer systems using the comparison approach. In this diagnosis model, a set of tasks is assigned to pairs of nodes and their outcomes are compared by neighboring nodes. Given that comparisons are performed by the nodes themselves, faulty nodes can incorrectly claim that fault-free nodes are faulty or that faulty ones are fault-free. The collections of all agreements and disagreements, i.e., the comparison outcomes, among the nodes are used to identify the set of permanently faulty nodes. Since the introduction of the comparison model, significant progress has been made in both theory and practice associated with the original model and its offshoots. Nevertheless, the problem of efficiently identifying the set of faulty nodes when not all the comparison outcomes are available to the diagnosis algorithm at the beginning of the diagnosis phase, i.e., partial syndromes, remains an outstanding research issue. In this paper, we introduce a novel diagnosis approach using neural networks to solve this fault identification problem using partial syndromes. Results from a thorough simulation study demonstrate the effectiveness of the neural-network-based self-diagnosis algorithm for randomly generated diagnosable systems of different sizes and under various fault scenarios. We have then conducted extensive simulations using partial syndromes and nondiagnosable systems. Simulations showed that the neural-network-based diagnosis approach provided good results making it a viable addition or alternative to existing diagnosis algorithms.
Fault tolerance, system-level self-diagnosis, comparison models, partial syndromes, neural networks.
Mourad Elhadef, "Comparison-Based System-Level Fault Diagnosis: A Neural Network Approach", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 6, pp. 1047-1059, June 2012, doi:10.1109/TPDS.2011.248