Reliable Distributed Systems, IEEE Symposium on (2009)
Niagara Falls, New York
Sept. 27, 2009 to Sept. 30, 2009
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/SRDS.2009.22
As the size of a centrally managed IP network increases, the cost of monitoring network devices and the number of reported events increase super-linearly. This in turn degrades the performance of the event correlation engine that is responsible for suppressing dependent events and escalating root cause events to a network administrator. To solve this scalability problem, we propose a distributed framework that partitions the network into smaller management domains and enables concurrent monitoring and event correlation in those domains. The gain in performance, however, comes with the challenge of correlating cross-domain events which occurs when failure in one domain induces events in other domain(s). In this paper, we investigate such situations and show in the worst case it would be impossible to determine the root cause. We propose a two step approach to solve this problem. First, we define a property called route-closure, which if satisfied by every partition not only minimizes the number of cross-domain events but also eliminates cases wherein root cause analysis may be inconclusive. We also describe a technology-centric partitioning mechanism that constructs partitions satisfying the route-closure property. Next, we propose a distributed architecture to efficiently identify and correlate cross-domain events. We use a commercial network management system to implement our distributed framework and run experiments by injecting synthetic events on large, real network topologies. Our experimental results show that our approach can manage over 200,000 managed entities and handle event bursts of size 15,000 in under five minutes without compromising the efficacy of event correlation.
Fault Diagnosis, Event Correlation, Network Management
Dipyaman Banerjee, Venkateswara Madduri, Mudhakar Srivatsa, "A Framework for Distributed Monitoring and Root Cause Analysis for Large IP Networks", Reliable Distributed Systems, IEEE Symposium on, vol. 00, no. , pp. 246-255, 2009, doi:10.1109/SRDS.2009.22