This Article 
 Bibliographic References 
 Add to: 
Fault Localization via Risk Modeling
October-December 2010 (vol. 7 no. 4)
pp. 396-409
Ramana Rao Kompella, Purdue University, West Lafayette
Jennifer Yates, AT&T Labs - Research, Florham Park
Albert Greenberg, Microsoft Research, Redmond
Alex C. Snoeren, University of California, San Diego, La Jolla
Internet backbone networks are under constant flux in order to keep up with demand and offer new features. The pace of change in technology often outstrips the pace of introduction of associated fault monitoring capabilities that are built into today's IP protocols and routers. Moreover, some of these new technologies cross networking layers, raising the potential for unanticipated interactions and service disruptions, which the individual layers' built-in monitoring capabilities may not detect. In these instances, operators typically employ higher layer monitoring techniques such as end-to-end liveness probing to detect lower or cross-layer failures, but lack tools to precisely determine where a detected failure may have occurred. In this paper, we evaluate the effectiveness of using risk modeling to translate high-level failure notifications into lower layer root causes in two specific scenarios in a tier-1 ISP. We show that a simple greedy heuristic works with accuracy exceeding 80 percent for many failure scenarios in simulation, while delivering extremely high precision (greater than 80 percent). We report our operational experience using risk modeling to isolate optical component and MPLS control plane failures in an ISP backbone.

[1] AVICI Systems, Inc., http:/, 2007.
[2] S. Brugbosi et al., "An Expert System for Real-Time Fault Diagnosis of the Italian Telecommunications Network," Proc. Third IEEE/IFIP Int'l Symp. Integrated Network Management, pp. 617-628, 1993.
[3] J. Cao, D. Davis, S.V. Wiel, and B. Yu, "Time-Varying Network Tomography," J. Am. Statistical Assoc., vol. 95, no. 452, pp. 1063-1075, 2000.
[4] J. Case, M. Fedor, M. Schoffstall, and J. Davin, "A Simple Network Management Protocol (SNMP)," IETF RFC 1157, May 1990.
[5] C.S. Chao, D.L. Yang, and A.C. Liu, "An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation," J. Network and Systems Management, vol. 9, no. 2, pp. 183-202, 2001.
[6] S. Chaudhuri, G. Hjalmtysson, and J. Yates, "Control of Lightpaths in an Optical Network," IPoverWDMpublications.html, Jan. 2000.
[7] M. Chen, A. Zheng, J. Lloyd, M.I. Jordan, and E. Brewer, "A Statistical Learning Approach to Failure Diagnosis," Proc. Int'l Conf. Autonomic Computing, May 2004.
[8] R.H. Deng, A.A. Lazar, and W. Wang, "A Probabilistic Approach to Fault Diagnosis in Linear Lightwave Networks," Proc. Third IEEE/IFIP Int'l Symp. Integrated Network Management, pp. 697-708, Apr. 1993.
[9] L. Fang, A. Atlas, F. Chiussi, K. Kompella, and G. Swallow, "LDP Failure Detection and Recovery," IEEE Comm., vol. 42, no. 10, pp. 117-123, Oct. 2004.
[10] G. Forman, M. Jain, M. Mansouri-Samani, J. Martinka, and A.C. Snoeren, "Automated Whole-System Diagnosis of Distributed Services Using Model-Based Reasoning," Proc. Ninth IFIP/IEEE Workshop Distributed Systems: Operations and Management, Oct. 1998.
[11] A. Gunnar, M. Johansson, and T. Telkamp, "Traffic Matrix Estimation on a Large ip Backbone: A Comparison on Real Data," Proc. ACM Internet Measurement Conf., Oct. 2004.
[12] S. Hanks, T. Li, D. Farinacci, and P. Traina, "Generic Routing Encapsulation over ipv4 Networks," IETF RFC 1702, Oct. 1994.
[13] P. Hong and P. Sen, "Incorporating Non-Deterministic Reasoning in Managing Heterogeneous Network," Proc. Second IEEE/IFIP Int'l Symp. Integrated Network Management, pp. 481-492, Apr. 1991.
[14] C. Hopps, "Analysis of an Equal-Cost Multi-Path Algorithm," IETF RFC 2992, Nov. 2000.
[15] K. Houck, S. Calo, and A. Finkel, "Towards a Practical Alarm Correlation System," Proc. Fourth IEEE/IFIP Int'l Symp. Integrated Network Management, 1995.
[16] HP Technologies, Open View, http:/, 2009.
[17] G. Jakobson and M.D. Weissman, "Alarm Correlation," IEEE Network, vol. 7, no. 6, pp. 52-59, Nov. 1993.
[18] D.S. Johnson, "Approximation Algorithms for Combinatorial Problems," J. Computer and System Sciences, vol. 9, pp. 256-278, 1974.
[19] R. Kompella, J. Yates, A. Greenberg, and A.C. Snoeren, "IP Fault Localization via Risk Modeling," Proc. Symp. Networked Systems Design and Implementation, May 2005.
[20] G. Liu, A.K. Mok, and E.J. Yang, "Composite Events for Network Event Correlation," Proc. Sixth IFIP/IEEE Int'l Symp. Integrated Network Management, May 1999.
[21] A. Medina, N. Taft, K. Salamatian, S. Bhattacharyya, and C. Diot, "Traffic Matrix Estimation: Existing Techniques and New Directions," Proc. ACM SIGCOMM, Aug. 2002.
[22] J. Moy, Ospf Version 2, IETF RFC 2328, Ascend Communications, Inc., Apr. 1998.
[23] A. Nucci, R. Cruz, N. Taft, and C. Diot, "Design of IGP Link Weights for Estimation of Traffic Matrices," Proc. IEEE INFOCOM, Mar. 2004.
[24] Y.A. Nygate, "Event Correlation Using Rule and Object Based Techniques," Proc. IEEE/IFIP Int'l Symp. Integrated Network Management, pp. 278-289, 1995.
[25] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, 1988.
[26] P. Wu, R. Bhatnagar, L. Epshtein, M. Bhandaru, and Z. Shi, "Alarm Correlation Engine (ACE)," Proc. Network Operation and Management Symp., pp. 733-742, 1998.
[27] R. Ramaswami and K. Sivarajan, Optical Networks: A Practical Perspective. Academic Press/Morgan Kaufmann, Feb. 1998.
[28] P. Sebos, J. Yates, D. Rubenstein, and A. Greenberg, "Effectiveness of Shared Risk Link Group Auto-Discovery in Optical Networks," Proc. Optical Fiber Comm. Conf., Mar. 2002.
[29] A. Shaikh and A. Greenberg, "OSPF Monitoring: Architecture, Design and Deployment Experience," Proc. Symp. Networked Systems Design and Implementation (NSDI), Mar. 2004.
[30] SMARTS, Inc., http:/, 2009.
[31] M. Steinder and A. Sethi, "Increasing Robustness of Fault localization through Analysis of Lost, Spurious and Positive Symptoms," Proc. IEEE INFOCOM, 2002.
[32] M. Steinder and A. Sethi, "End-to-End Service Failure Diagnosis Using Belief Networks," Proc. Network Operation and Management Symp., Apr. 2002.
[33] M. Steinder and A. Sethi, "A Survey of Fault Localization Techniques in Computer Networks," Science of Computer Programming, Special Edition on Topics in System Administration, vol. 53, no. 2, pp. 165-194, Nov. 2004.
[34] M. Steinder and A. Sethi, "Non-Deterministic Fault Localization in Communication Systems Using Belief Networks," IEEE/ACM Trans. Networking, vol. 12, no. 5, pp. 809-822, Oct. 2004.
[35] C. Tebaldi and M. West, "Bayesian Inference on Network Traffic Using Link Count Data," J. Am. Statistical Assoc., vol. 93, no. 442, pp. 557-576, 1998.
[36] Y. Vardi, "Network Tomography: Estimating Source-Destination Traffic Intensities from Link Data," J. Am. Statistical Assoc., vol. 91, pp. 365-377, 1996.
[37] H. Wietgrefe et al., "Using Neural Networks for Alarm Correlation in Cellular Phone Networks," Proc. Int'l Workshop Applications of Neural Networks in Telecomm., 1997.
[38] S.A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie, "High Speed and Robust Event Correlation," IEEE Comm., vol. 34, no. 5, pp. 82-90, May 1996.
[39] Y. Zhang, M. Roughan, N. Duffield, and A. Greenberg, "Fast Accurate Computation of Large-Scale ip Traffic Matrices from Link Loads," Proc. ACM SIGMETRICS '03, June 2003.
[40] Y. Zhang, M. Roughan, C. Lund, and D. Donoho, "An Information-Theoretic Approach to Traffic Matrix Estimation," Proc. ACM SIGCOMM, Aug. 2003.

Index Terms:
Fault localization, risk modeling, MPLS, OSPF, backbone networks.
Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, Alex C. Snoeren, "Fault Localization via Risk Modeling," IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 4, pp. 396-409, Oct.-Dec. 2010, doi:10.1109/TDSC.2009.37
Usage of this product signifies your acceptance of the Terms of Use.