Issue No. 04 - October-December (2007 vol. 4)
In today's world where distributed systems form many of our critical infrastructures, dependability outagesare becoming increasingly common. In many situations, it is necessary to not just detect a failure, but alsoto diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging since highthroughput applications with frequent interactions between the different components allow fast errorpropagation. It is desirable to consider applications as black-boxes for the diagnostic process. In thispaper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. TheMonitor only observes the message exchanges between the protocol entities (PEs) remotely and doesnot access internal protocol state. At runtime, it builds a causal graph between the PEs based on theircommunication and uses this together with a rule base of allowed state transition paths to diagnose thefailure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfectcoverage. The hierarchical Monitor framework allows distributed diagnosis handling failures at individualMonitors. The framework is implemented and applied to a reliable multicast protocol executing on ourcampus-wide network. Fault injection experiments are carried out to evaluate the accuracy and latency ofthe diagnosis.
Distributed system diagnosis, runtime monitoring, hierarchical Monitor system, fault injection based evaluation
Miguel P. Correia, Paulo J. Veríssimo, Gunjan Khanna, Saurabh Bagchi, Mike Yu Cheng, Padma Varadharajan, "Automated Rule-Based Diagnosis through a Distributed Monitor System", IEEE Transactions on Dependable and Secure Computing, vol. 4, no. , pp. 266-279, October-December 2007, doi:10.1109/TDSC.2007.70211