This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Automated Online Monitoring of Distributed Applications through External Monitors
April-June 2006 (vol. 3 no. 2)
pp. 115-129
It is a challenge to provide detection facilities for large-scale distributed systems running legacy code on hosts that may not allow fault tolerant functions to execute on them. It is tempting to structure the detection in an observer system that is kept separate from the observed system of protocol entities, with the former only having access to the latter's external message exchanges. In this paper, we propose an autonomous self-checking Monitor system, which is used to provide fast detection to underlying network protocols. The Monitor architecture is application neutral and, therefore, lends itself to deployment for different protocols, with the rulebase against which the observed interactions are matched, making it specific to a protocol. To make the detection infrastructure scalable and dependable, we extend it to a hierarchical Monitor structure. The Monitor structure is made dynamic and reconfigurable by designing different interactions to cope with failures, load changes, or mobility. The latency of the Monitor system is evaluated under fault free conditions, while its coverage is evaluated under simulated error injections.

[1] A.S. Danthine, “Protocol Representation with Finite State Models,” IEEE Trans. Comm., vol. 28, no. 4, pp. 632-643, Apr. 1980.
[2] L. Lamport, “The Temporal Logic of Actions,” ACM Trans. Programming Languages and Systems, vol. 16, no. 3, pp. 872-923, 1994.
[3] Z. Liu and M. Joseph, “Specification and Verification of Fault-Tolerance, Timing, and Scheduling,” ACM Trans. Programming Languages and Systems, vol. 21, no. 1, pp. 46-89, 1999.
[4] B. Berthomieu and M. Diaz, “Modeling and Verification of Time Dependent Systems using Time Petri Nets,” IEEE Trans. Software Eng., vol. 17, no. 3, pp. 259-273, Mar. 1991.
[5] W. Peng, “Deadlock Detection in Communicating Finite State Machines by Even Reachability Analysis,” Proc. IEEE Conf. Computer Comm. and Networks (ICCCN), pp. 656-662, Sept. 1995.
[6] A. Agarwal and J.W. Atwood, “A Unified Approach to Fault-Tolerance in Communication Protocols Based on Recovery Procedures,” IEEE/ACM Trans. Networking, vol. 4, no. 5, pp. 785-795, Oct. 1996.
[7] L.-B. Chen and I-C. Wu, “Detection of Summative Global Predicates,” Proc. IEEE Conf. Parallel and Distributed Systems (ICPADS '97), pp. 466-473, Dec 1997.
[8] M. Zulkernine and R.E. Seviora, “A Compositional Approach to Monitoring Distributed Systems,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '02), pp. 763-772, June 2002.
[9] C. Wang and M. Schwartz, “Fault Detection with Multiple Observers,” IEEE/ACM Trans. Networking, vol. 1, no. 1, pp. 48-55, Feb. 1993.
[10] G. Khanna, J.S. Rogers, and S. Bagchi, “Failure Handling in a Reliable Multicast Protocol for Improving Buffer Utilization and Accommodating Heterogeneous Receivers,” Proc. IEEE Pacific Rim Dependable Computing Conf. (PRDC '04), pp. 15-24, Mar. 2004.
[11] D.M. Chiu, S. Hurst, M. Kadansky, and J. Wesley, “TRAM: A Tree-Based Reliable Multicast Protocol,” Sun Technical Report TR 98-66, July 1998.
[12] D.M. Chiu, M. Kadansky, J. Provino, J. Wesley, H. Bischof, and H. Zhu, “A Congestion Control Algorithm for Tree-Based Reliable Multicast Protocols,” Proc. INFOCOM '02, pp. 1209-1217, 2002.
[13] W. Chen, S. Toueg, and M.K. Aguilera, “On the Quality of Service of Failure Detectors,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '00), pp. 191-201, June 2000.
[14] R. Baldoni, J.-M. Helary, and M. Raynal, “From Crash Fault-Tolerance to Arbitrary-Fault Tolerance: Towards a Modular Approach,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '00), pp. 273-282, June 2000.
[15] M. Diaz, G. Juanole, and J.-P. Courtiat, “Observer— A Concept for Formal On-Line Validation of Distributed Systems,” IEEE Trans. Software Eng., vol. 20, no. 12, pp. 900-913, Dec. 1994.
[16] S. Krishna, T. Diamond, and V.S.S. Nair, “Hierarchical Object Oriented Approach to Fault Tolerance in Distributed Systems,” Proc. IEEE Int'l Symp. Software Reliability Eng. (ISSRE '93), pp. 168-177, Nov. 1993.
[17] G. Khanna, P. Varadharajan, and S. Bagchi, “Self Checking Network Protocols: A Monitor Based Approach,” Proc. 23rd IEEE Symp. Reliable Distributed Systems (SRDS '04), pp. 18-30, Oct. 2004.
[18] R. Alur, R.K. Brayton, T.A. Henzinger, S. Qadeer, and S.K. Rajamani, “Partial-Order Reduction in Symbolic State-Space Exploration,” J. Formal Methods in System Design, 2001.
[19] K.L. McMillan, Symbolic Model Checking: An Approach to the State-Explosion Problem. Dordrecht: Kluwer Academic Publishers, 1993.
[20] A.W. Mazurkiewicz, “Basic Notions of Trace Theory,” Linear Time, Branching Time, and Partial Order in Logics and Models for Concurrency, J.W. de Bakker, W.-P. de Roever, and G. Rozenberg, eds., pp. 285-363, 1989.
[21] R.V. Renesse, K.P. Birman, and W. Vogels, “Astrolabe: A Robust and Scalable Technology for Distributed System Monitoring, Management, and Data Mining,” ACM Trans. Computer Systems, vol. 21, no. 2, pp. 164-206, May 2003.
[22] M.L. Massie, B.N. Chun, and D.E. Culler, “The Ganglia Distributed Monitoring System: Design, Implementation, and Experience,” Parallel Computing, vol. 30, no. 7, July 2004.
[23] K. Bhargavan, S. Chandra, P.J. McCann, and C.A. Gunter, “What Packets May Come: Automata for Network Monitoring,” ACM SIGPLAN Notices, vol. 36, no. 3, pp. 206-219, 2001.
[24] I. Lee, S. Kannan, M. Kim, O. Sokolsky, and M. Viswanathan, “Runtime Assurance Based on Formal Specifications,” Proc. Int'l Conf. Parallel and Distributed Processing Techniques and Applications, 1999.
[25] V. Paxson, “Automated Packet Trace Analysis of TCP Implementations,” Computer Comm. Rev., vol. 27, no. 4, Oct. 1997.
[26] M.K. Aguilera, J.C. Mogul, J.L. Wiener, P. Reynolds, and A. Muthitacharoen, “Performance Debugging for Distributed Systems of Black Boxes,” Proc. ACM Symp. Operating Systems Principles (SOSP), 2003.
[27] M.Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, “Pinpoint: Problem Determination in Large, Dynamic Internet Services,” Proc. 2002 Int'l Conf. Dependable Systems and Networks (DSN), pp. 595-604, 2002.
[28] SNMP Research International Inc., “Simple Network Management Protocol,” http://www.snmp.comprotocol/, 2006.
[29] Quest Software, “Big Brother System and Network Monitor,” http:/www.bb4.org/, 2006.
[30] P. Mason, “Turning IT Overhead into Business Value by Improving Infrastructure Management,” IDC White Paper, May 2002.
[31] Hewlett Packard, “HP OpenView Management Solutions for Your Adaptive Enterprise,” www.openview.hp.com, 2006.
[32] E. Skoudis, Counter Hack, chapter 2. Prentice-Hall Inc., 2002.
[33] M.Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, “Pinpoint: Problem Determination in Large, Dynamic Internet Services,” Proc. 2002 Int'l Conf. Dependable Systems and Networks (DSN), pp. 595-604, 2002.
[34] M.K. Aguilera, J.C. Mogul, J.L. Wiener, P. Reynolds, and A. Muthitacharoen, “Performance Debugging for Distributed Systems of Black Boxes,” Proc. 19th ACM Symp. Operating Systems Principles (SOSP), 2003.
[35] I. Rouvellou and G.W. Hart, “Automatic Alarm Correlation for Fault Identification,” Proc. Infocom, pp. 553-561, 1995.
[36] I. Katzela and M. Schwartz, “Schemes for Fault Identification in Communication Networks,” IEEE/ACM Trans. Networking, vol. 3, no. 6, pp. 753-764, Dec. 1995.
[37] S. Bagchi, Y. Liu, K. Whisnant, Z. Kalbarczyk, R.K. Iyer, Y. Levendel, and L.G. Votta, “A Framework for Database Audit and Control Flow Checking for a Wireless Telephone Network Controller,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '01), pp. 225-234, 2001.
[38] G. Khanna, P. Varadharajan, M. Cheng, and S. Bagchi, “Automated Monitor Based Diagnosis in Distributed Systems,” Purdue ECE Technical Report 05-13, Aug. 2005, also submitted to IEEE Trans. Dependable and Secure Computing.
[39] G. Khanna, M.Y. Cheng, J. Dyaberi, S. Bagchi, M.P. Correia, and P. Vérissimo, “Probabilistic Diagnosis through NonIntrusive Monitoring in Distributed Applications,” Purdue ECE Technical Report 05-19, Nov. 2005.

Index Terms:
Error detection, blackbox detection, monitor system, temporal and combinatorial rules, reliable multicast.
Citation:
Gunjan Khanna, Padma Varadharajan, Saurabh Bagchi, "Automated Online Monitoring of Distributed Applications through External Monitors," IEEE Transactions on Dependable and Secure Computing, vol. 3, no. 2, pp. 115-129, April-June 2006, doi:10.1109/TDSC.2006.17
Usage of this product signifies your acceptance of the Terms of Use.