This Article 
 Bibliographic References 
 Add to: 
Toward Optimal Network Fault Correction in Externally Managed Overlay Networks
March 2010 (vol. 21 no. 3)
pp. 354-366
Patrick P.C. Lee, Chinese University of Hong Kong, Hong Kong
Vishal Misra, Columbia University, New York
Dan Rubenstein, Columbia University, New York
We consider an end-to-end approach of inferring probabilistic data forwarding failures in an externally managed overlay network, where overlay nodes are independently operated by various administrative domains. Our optimization goal is to minimize the expected cost of correcting (i.e., diagnosing and repairing) all faulty overlay nodes that cannot properly deliver data. Instead of first checking the most likely faulty nodes as in conventional fault localization problems, we prove that an optimal strategy should start with checking one of the candidate nodes, which are identified based on a potential function that we develop. We propose several efficient heuristics for inferring the best node to be checked in large-scale networks. By extensive simulation, we show that we can infer the best node in at least 95 percent of time, and that first checking the candidate nodes rather than the most likely faulty nodes can decrease the checking cost of correcting all faulty nodes.

[1] M. Adler, T. Bu, R. Sitaraman, and D. Towsley, "Tree Layout for Internal Network Characterizations in Multicast Networks," Proc. Int'l Workshop Networked Group Comm. (NGC '01), 2001.
[2] D. Andersen, "Mayday: Distributed Filtering for Internet Services," Proc. Fourth Usenix Symp. Internet Technologies and Systems, Mar. 2003.
[3] D. Andersen, H. Balakrishnan, F. Kaashoek, and R. Morris, "Resilient Overlay Networks," Proc. 18th ACM Symp. Operating Systems Principles (SOSP), Oct. 2001.
[4] I. Avramopoulos, H. Kobayashi, R. Wang, and A. Krishnamurthy, "Highly Secure and Efficient Routing," Proc. IEEE INFOCOM, Mar. 2004.
[5] A.-L. Barabási and R. Albert, "Emergence of Scaling in Random Networks," Science, vol. 286, pp. 509-512, Oct. 1999.
[6] Y. Bejerano and R. Rastogi, "Robust Monitoring of Link Delays and Faults in IP Networks," Proc. IEEE INFOCOM, 2003.
[7] R. Cáceres, N. Duffield, J. Horowitz, and D. Towsley, "Multicast-Based Inference of Network Internal Loss Characteristics," IEEE Trans. Information Theory, vol. 45, no. 7, pp. 2462-2480, Nov. 1999.
[8] K. Carlberg, "Emergency Telecommunications Services (ETS) Requirements for a Single Administrative Domain," RFC 4375, Jan. 2006.
[9] Clip2. "The Gnutella Protocol Specification v0.4," http:/www., 2009.
[10] M. Coates, A.O. Hero, R. Nowak, and B. Yu, "Internet Tomography," IEEE Signal Processing Magazine, vol. 19, no. 3, pp. 47-65, May 2002.
[11] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms, second ed. MIT Press, 2001.
[12] D. Dagon, G. Gu, C.P. Lee, and W. Lee, "A Taxonomy of Botnet Structures," Proc. Network and Distributed System Security Symp. (NDSS), 2007.
[13] W. Du and A. Mathur, "Testing for Software Vulnerability Using Environment Perturbation," Proc. Int'l Conf. Dependable Systems and Networks, 2000.
[14] Z. Duan, Z.-L. Zhang, and Y. Hou, "Service Overlay Networks: SLAs, QoS, and Bandwidth Provisioning," IEEE/ACM Trans. Networking, vol. 11, no. 6, pp. 870-883, Dec. 2003.
[15] T. Dübendorfer, A. Wagner, and B. Plattner, "An Economic Damage Model for Large-Scale Internet Attacks," Proc. IEEE Int'l Workshop Enabling Technologies (WETICE), June 2004.
[16] N. Duffield, J. Horowitz, F.L. Presti, and D. Towsley, "Multicast Topology Inference from Measured End-to-End Loss," IEEE Trans. Information Theory, vol. 48, no. 1, pp. 26-45, Jan. 2002.
[17] B. Gnedenko and I.A. Ushakov, Probabilistic Reliability Engineering. John Wiley & Sons, 1995.
[18] S. Kandula, D. Katabi, and J.-P. Vasseur, "Shrink: A Tool for Failure Diagnosis in IP Networks," Proc. ACM SIGCOMM '05, Aug. 2005.
[19] I. Katzela and M. Schwartz, "Schemes for Fault Identification in Communication Networks," IEEE/ACM Trans. Networking, vol. 3, no. 6, pp. 733-764, Dec. 1995.
[20] A. Keromytis, V. Misra, and D. Rubenstein, "SOS: An Architecture for Mitigating DDoS Attacks," IEEE J. Selected Areas in Comm., Special Issue on Service Overlay Networks, vol. 22, no. 1, pp. 176-188, Jan. 2004.
[21] R.R. Kompella, J. Yates, A. Greenberg, and A.C. Snoeren, "IP Fault Localization via Risk Modeling," Proc. Networked Systems Design and Implementation (NSDI), 2005.
[22] P.P.C. Lee, V. Misra, and D. Rubenstein, "Toward Optimal Network Fault Correction via End-to-End Inference," Computer Science technical report, Columbia Univ., http://dna-wsl.cs. paperfile/, May 2006.
[23] P.P.C. Lee, V. Misra, and D. Rubenstein, "Toward Optimal Network Fault Correction via End-to-End Inference," Proc. IEEE INFOCOM, May 2007.
[24] S. Lee and K.G. Shin, "Probabilistic Diagnosis of Multiprocessor Systems," ACM Computing Survey, vol. 26, no. 1, pp. 121-139, Mar. 1994.
[25] Z. Li, L. Yuan, P. Mohapatra, and C.-N. Chuah, "On the Analysis of Overlay Failure Detection and Recovery," Computer Networks, vol. 51, pp. 3823-3843, 2007.
[26] F. LoPresti, N. Duffield, J. Horowitz, and D. Towsley, "Multicast-Based Inference of Network Internal Delay Distributions," IEEE/ACM Trans. Networking, vol. 10, no. 6, pp. 761-775, Dec. 2002.
[27] A. Medina, A. Lakhina, I. Matta, and J. Byers, "BRITE: An Approach to Universal Topology Generation," Proc. Int'l Workshop Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS), Aug. 2001.
[28] A.T. Mizrak, Y.-C. Cheng, K. Marzullo, and S. Savage, "Fatih: Detecting and Isolating Malicious Routers," Proc. IEEE Conf. Dependable Systems and Networks (DSN), June 2005.
[29] M. Rabbat, R. Nowak, and M. Coates, "Multiple Source, Multiple Destination Network Tomography," Proc. IEEE INFOCOM, 2004.
[30] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, "A Scalable Content-Addressable Network," Proc. ACM SIGCOMM, 2001.
[31] M. Steinder and A. Sethi, "A Survey of Fault Localization Techniques in Computer Networks," Science of Computer Programming, Special Edition on Topics in System Administration, vol. 53, no. 2, pp. 165-194, Nov. 2004.
[32] M. Steinder and A.S. Sethi, "Probabilistic Fault Localization in Communication Systems Using Belief Networks," IEEE/ACM Trans. Networking, vol. 12, no. 5, pp. 809-822, Oct. 2004.
[33] I. Stoica, R. Morris, D. Karger, M.F. Kaashoek, and H. Balakrishnan, "Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications," Proc. ACM SIGCOMM, 2001.
[34] I.A. Ushakov, Handbook of Reliability Engineering. John Wiley & Sons, 1994.
[35] S.A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie, "High Speed and Robust Event Correlation," IEEE Comm. Magazine, vol. 34, no. 5, pp. 82-90, May 1996.
[36] S.Q. Zhuang, D. Geels, I. Stoica, and R.H. Katz, "On Failure Detection Algorithms in Overlay Networks," Proc. IEEE INFOCOM, Mar. 2005.

Index Terms:
Network management, network diagnosis and correction, fault localization and repair, reliability engineering.
Patrick P.C. Lee, Vishal Misra, Dan Rubenstein, "Toward Optimal Network Fault Correction in Externally Managed Overlay Networks," IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 3, pp. 354-366, March 2010, doi:10.1109/TPDS.2009.66
Usage of this product signifies your acceptance of the Terms of Use.