This Article 
 Bibliographic References 
 Add to: 
Automated Rule-Based Diagnosis through a Distributed Monitor System
October-December 2007 (vol. 4 no. 4)
pp. 266-279
In today's world where distributed systems form many of our critical infrastructures, dependability outagesare becoming increasingly common. In many situations, it is necessary to not just detect a failure, but alsoto diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging since highthroughput applications with frequent interactions between the different components allow fast errorpropagation. It is desirable to consider applications as black-boxes for the diagnostic process. In thispaper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. TheMonitor only observes the message exchanges between the protocol entities (PEs) remotely and doesnot access internal protocol state. At runtime, it builds a causal graph between the PEs based on theircommunication and uses this together with a rule base of allowed state transition paths to diagnose thefailure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfectcoverage. The hierarchical Monitor framework allows distributed diagnosis handling failures at individualMonitors. The framework is implemented and applied to a reliable multicast protocol executing on ourcampus-wide network. Fault injection experiments are carried out to evaluate the accuracy and latency ofthe diagnosis.

[1] META Group, “Quantifying Performance Loss: IT Performance Eng. and Measurement Strategies,” http://www.metagroup. com/cgi-bin/inetcgi/, 2000.
[2] Costs of Computer Downtime to American Businesses, FIND/SVP, 1993.
[3] G. Khanna, J. Rogers, and S. Bagchi, “Failure Handling in a Reliable Multicast Protocol for Improving Buffer Utilization and Accommodating Heterogeneous Receivers,” Proc. 10th IEEE Pacific Rim Dependable Computing Conf. (PRDC '04), pp. 15-24, 2004.
[4] G. Khanna, P. Varadharajan, and S. Bagchi, “Self-Checking Network Protocols: A Monitor-Based Approach,” Proc. 23rd IEEE Symp. Reliable Distributed Systems (SRDS '04), pp. 18-30, 2004.
[5] M. Diaz, G. Juanole, and J.-P. Courtiat, “Observer—A Concept for Formal On-Line Validation of Distributed Systems,” IEEE Trans. Software Eng., vol. 20, no. 12, pp. 900-913, Dec. 1994.
[6] M. Zulkernine and R.E. Seviora, “A Compositional Approach to Monitoring Distributed Systems,” IEEE Dependable Systems and Networks, pp. 763-772, 2002.
[7] S. Bagchi, Y. Liu, Z. Kalbarczyk, R.K. Iyer, Y. Levendel, and L. Votta, “A Framework for Database Audit and Control Flow Checking for a Wireless Telephone Network Controller,” Proc. Int'l Conf. Dependable Systems and Networks (DSN '01), pp. 225-234, 2001.
[8] R. Buskens and R. Bianchini Jr., “Distributed On-Line Diagnosis in the Presence of Arbitrary Faults,” Proc. 23rd Int'l Symp. Fault-Tolerant Computing (FTCS '93), 1993.
[9] D.M. Chiu, M. Kadansky, J. Provino, J. Wesley, H. Bischof, and H. Zhu, “A Congestion Control Algorithm for Tree-Based Reliable Multicast Protocols,” Proc. IEEE INFOCOM '02, pp. 1209-1217, 2002.
[10] T. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp. 225-267, 1996.
[11] G. Bracha and S. Toueg, “Asynchronous Consensus and Broadcast Protocols,” J. ACM, vol. 32, no. 4, pp. 824-840, 1985.
[12] M. Correia, N.F. Neves, and P. Veríssimo, “How to Tolerate Half Less One Byzantine Nodes in Practical Distributed Systems,” Proc. 23rd Int'l Symp. Reliable and Distributed Systems (SRDS '04), pp. 174-183, 2004.
[13] M. Correia, N.F. Neves, and P. Veríssimo, “The Design of a COTS Real-Time Distributed Security Kernel,” Proc. Fourth European Dependable Computing Conf. (EDCC '02), pp. 234-252, 2002.
[14] M. Correia, N.F. Neves, L.C. Lung, and P. Veríssimo, “Low Complexity Byzantine-Resilient Consensus,” Distributed Computing, vol. 17, no. 3, pp. 237-249, 2005.
[15] A. Mostefaoui, M. Raynal, and C. Travers, “Crash-Resilient Time-Free Eventual Leadership,” Proc. 23rd IEEE Int'l Symp. Reliable Distributed Systems (SRDS '04), pp. 208-217, 2004.
[16] I. Katzela and M. Schwartz, “Schemes for Fault Identification in Communication Networks,” IEEE/ACM Trans. Networking, vol. 3, no. 6, pp. 753-764, 1995.
[17] F.P. Preparata, G. Metze, and R.T. Chien, “On the Connection Assignment Problem of Diagnosable Systems,” IEEE Trans. Electronic Computers, vol. 16, no. 6, pp. 848-854, 1967.
[18] S. Maheshwari and S. Hakimi, “On Models for Diagnosable Systems and Probabilistic Fault Diagnosis,” IEEE Trans. Computers, vol. 25, pp. 228-236, 1976.
[19] D. Fussel and S. Rangarajan, “Probabilistic Diagnosis of Multiprocessor Systems with Arbitrary Connectivity,” Proc. 19th Int'l IEEE Symp. Fault-Tolerant Computing (FTCS '89), pp. 560-565, 1989.
[20] M. Barborak, A. Dahbura, and M. Malek, “The Consensus Problem in Fault-Tolerant Computing,” ACM Computing Surveys, vol. 25, no. 2, pp. 171-220, June 1993.
[21] A. Bagchi and S. Hakimi, “An Optimal Algorithm for Distributed System Level Diagnosis,” Proc. 21st Int'l Symp. Fault Tolerant Computing (FTCS '91), pp. 214-221, 1991.
[22] R. Chillarege and R.K. Iyer, “Measurement-Based Analysis of Error Latency,” IEEE Trans. Computers, vol. 36, no. 5, May 1987.
[23] S. Lee and K.G. Shin, “On Probabilistic Diagnosis of Multiprocessor Systems Using Multiple Syndromes,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 6, pp. 630-638, June 1994.
[24] A. Avizienis and J.-C. Laprie, “Dependable Computing: From Concepts to Design Diversity,” Proc. IEEE, vol. 74, no. 5, pp. 629-638, 1986.
[25] S. Chandra and P.M. Chen, “How Fail-Stop Are Faulty Programs?” Proc. 28th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '98), pp. 240-249, 1998.
[26] H. Madeira and J.G. Silva, “Experimental Evaluation of the Fail-Silent Behavior in Computers without Error Masking,” Proc. 24th Int'l Symp. Fault-Tolerant Computing (FTCS '94), pp. 350-359, 1994.
[27] A. Brown, G. Kar, and A. Keller, “An Active Approach to Characterizing Dynamic Dependencies for Problem Determination in a Distributed Environment,” Proc. Int'l Symp. Integrated Network Management (IM '01), 2001.
[28] S. Bagchi, G. Kar, and J.L. Hellerstein, “Dependency Analysis in Distributed Systems Using Fault Injection: Application to Problem Determination in an e-Commerce Environment,” Proc. 12th Int'l Workshop Distributed Systems: Operations and Management (DSOM '01), 2001.
[29] M.K. Aguilera, J.C. Mogul, J.L. Wiener, P. Reynolds, and A. Muthitacharoen, “Performance Debugging for Distributed Systems of Black Boxes,” Proc. 19th ACM Symp. Operating Systems Principles (SOSP '03), 2003.
[30] E.P. Duarte and T. Nanya, “A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm,” IEEE Trans. Computers, vol. 47, no. 1, pp. 34-45, Jan. 1998.
[31] J.L. Hellerstein, “A General-Purpose Algorithm for Quantitative Diagnosis of Performance Problems,” J. Network and Systems Management, 2003.
[32] P. Barham, R. Isaacs, R. Mortier, and D. Narayanan, “Magpie: On-Line Modelling and Performance-Aware Systems,” Proc. ACM Ninth Workshop Hot Topics in Operating Systems (HotOS '03), pp.85-90, 2003.
[33] R. Alur, R.K. Brayton, T.A. Henzinger, S. Qadeer, and S.K. Rajamani, “Partial-Order Reduction in Symbolic State-Space Exploration,” Proc. Ninth Int'l Conf. Computer-Aided Verification (CAV '97), pp. 340-351, 1997.
[34] K. Ravi and F. Somenzi, “High–Density Reachability Analysis,” Proc. IEEE/ACM Int'l Conf. Computer-Aided Design (ICCAD '95), pp.154-158, 1995.
[35] J.R. Burch, E.M. Clarke, and D.E. Long, “Symbolic Model Checking with Partitioned Transition Relations,” Proc. Design Automation Conf. (DAC '91), pp. 403-407, 1991.
[36] K.L. McMillan, Symbolic Model Checking: An Approach to the State-Explosion Problem. Kluwer Academic Publishers, 1993.
[37] L. Lamport, “Time, Clocks, and the Ordering of Events in a Distributed System,” Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[38] M. Castro and B. Liskov, “Proactive Recovery in a Byzantine-Fault-Tolerant System,” Proc. Fourth Symp. Operating Systems Design and Implementation (OSDI '00), Oct. 2000.
[39] S. Lee and K. Shin, “Optimal and Efficient Probabilistic Distributed Diagnosis Schemes,” IEEE Trans. Computers, vol. 42, no. 7, pp. 882-886, July 1993.
[40] S.T. King and P.M. Chen, “Backtracking Intrusions,” Proc. Symp. Operating Systems Principles (SOSP), Oct. 2003.
[41] R.V. Renesse, K.P. Birman, and W. Vogels, “Astrolabe: A Robust and Scalable Technology for Distributed System Monitoring, Management, and Data Mining,” ACM Trans. Computer Systems, vol. 21, no. 2, pp. 164-206, 2003.
[42] J. Offutt and A. Abdurazik, “Generating Tests from UML Specifications,” Proc. Second Int'l Conf. Unified Modeling Language—Beyond the Standard (UML '99), pp. 416-429, 1999.
[43] C. Meudec, “Automatic Generation of Software Tests from Formal Specifications,” PhD dissertation, The Queen's Univ. of Belfast, 1997.
[44] E.N. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, Sept. 2002.
[45] R. Schwarz and F. Mattern, “Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail,” Distributed Computing, vol. 7, no. 3, pp. 149-174, 1994.
[46] O. Babaoglu and K. Marzullo, “Detecting Global States of Distributed System: Fundamental Concepts and Mechanisms,” Distributed Systems, Addison-Wesley, pp. 55-96, 1993.

Index Terms:
Distributed system diagnosis, runtime monitoring, hierarchical Monitor system, fault injection based evaluation
Gunjan Khanna, Mike Yu Cheng, Padma Varadharajan, Saurabh Bagchi, Miguel P. Correia, Paulo J. Veríssimo, "Automated Rule-Based Diagnosis through a Distributed Monitor System," IEEE Transactions on Dependable and Secure Computing, vol. 4, no. 4, pp. 266-279, Oct.-Dec. 2007, doi:10.1109/TDSC.2007.70211
Usage of this product signifies your acceptance of the Terms of Use.