This Article 
 Bibliographic References 
 Add to: 
Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping
October 2007 (vol. 19 no. 10)
pp. 1308-1320
Fast and accurate failure detection is becoming essential in managing large scale Internet services. This paper proposes a novel detection approach based on the subspace mapping between system inputs and internal measurements. By exploring these contextual dependencies, our detector can initiate repair actions accurately, increasing the availability of system. While a classical statistical method, the canonical correlation analysis (CCA), is presented in the paper to achieve subspace mapping, we also propose a more advanced technique, the principal canonical correlation analysis (PCCA), to improve the performance of CCA based detector. PCCA extracts a principal subspace from internal measurements that is not only highly correlated with the inputs, but also a significant representative of original measurements. Experimental results on a J2EE based web application demonstrate that such property of PCCA is especially beneficial to failure detection tasks.

[1] M.K. Aguilera, W. Chen, and S. Toueg, “Using the Heartbeat Failure Detector for Quiescent Reliable Communication and Consensus in Partitionable Networks,” Theoretical Computer Science, special issue on distributed algorithms, vol. 220, pp. 3-30, 1999.
[2] T.W. Anderson, An Introduction to Multivariate Statistical Analysis, second ed. John Wiley & Sons, 1984.
[3] P. Barham, R. Isaacs, R. Mortier, and D. Narayanan, “Magpie: Real-Time Modelling and Performance-Aware Systems,” Proc. Ninth Workshop Hot Topics in Operating Systems (HotOS IX), May 2003.
[4] P. Bodik, G. Friedman, and L. Biewald et al., “, Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” Proc. Second Int'l Conf. Autonomic Computing (ICAC '05), pp. 89-100, June 2005.
[5] G.E.P. Box and G.M. Jenkins, “Further Contributions to Adaptive Quality Control: Simultaneous Estimation of Dynamics: Nonzero Costs,” Bull. Int'l Statistical Inst., vol. 34, pp. 943-974, 1963.
[6] W.L. Brogan, Modern Control Theory, third ed. Prentice Hall, 1991.
[7] H. Chen, G. Jiang, C. Ungureanu, and K. Yoshihira, “Failure Detection and Localization in Component Based Systems by Online Tracking,” Proc. 11th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '05), pp. 750-755, 2005.
[8] H. Chen, G. Jiang, C. Ungureanu, and K. Yoshihira, “Combining Supervised and Unsupervised Monitoring for Fault Detection in Distributed Computing Systems,” Proc. 21st Ann. ACM Symp. Applied Computing (SAC '06)—Track on Dependable and Adaptive Distributed Systems, pp. 705-709, 2006.
[9] M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, “Pinpoint: Problem Determination in Large, Dynamic Systems,” Proc. Int'l Performance and Dependability Symp., June 2002.
[10] HP Corp., HP Openview, http:/, 2006.
[11] F. Feather and R. Maxion, “Fault Detection in an Ethernet Network Using Anomaly Signature Matching,” Proc. ACM Conf. Comm. Architectures, Protocols and Applications (SIGCOMM '93), vol. 23, pp. 279-288, 1993.
[12] G.H. Golub and C.F. Van Loan, Matrix Computations, third ed. Johns Hopkins Univ. Press, 1996.
[13] K.C. Gross, R.M. Singer, J.P. Herzog, R. Vanalstine, and S.W. Wagerich, “Application of a Model-Based Fault Detection System to Nuclear Plant Signals,” Proc. Int'l Conf. Intelligent System Applications to Power Systems (ISAP '97), pp. 66-70, 1997.
[14] Z. Guo, G. Jiang, H. Chen, and K. Yoshihira, “Tracking Probabilistic Correlation of Monitoring Data for Fault Detection and Isolation in Complex Systems,” Proc. Int'l Conf. Dependable Systems and Networks (DSN '06), pp. 259-268, 2006.
[15] K.A. Heller, K.M. Svore, A.D. Keromytis, and S.J. Stolfo, “One Class Support Vector Machines for Detecting Anomalous Windows Registry Accesses,” Proc. Workshop Data Mining for Computer Security (DMSEC '03), 2003.
[16] H. Hotelling, “Analysis of a Complex of Statistical Variables into Principal Components,” J. Educational Psychology, vol. 24, pp. 417-441, 1933.
[17] H. Hotelling, “Relations between Two Sets of Variables,” Biometrika, vol. 28, pp. 321-377, 1936.
[18] IBM, Tivoli Business System Manager, http:/, 2006.
[19] T. Idé and H. Kashima, “Eigenspace-Based Anomaly Detection in Computer Systems,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 440-449, Aug. 2004.
[20] G. Jiang, H. Chen, C. Ungureanu, and K. Yoshihira, “Multi-Resolution Abnormal Trace Detection Using Varied-Length N-Grams and Automata,” Proc. Second Int'l Conf. Autonomic Computing (ICAC '05), pp. 111-122, June 2005.
[21] G. Jiang, H. Chen, and K. Yoshihira, “Discovering Likely Invariants of Distributed Transaction Systems for Autonomic System Management,” Proc. Third Int'l Conf. Autonomic Computing (ICAC '06), pp. 199-208, June 2006.
[22] I.T. Jolliffe, Principal Component Analysis. Springer, 1986.
[23] R.E. Jones and R. Lins, Garbage Collection: Algorithms for Automatic Dynamic Memory Management. John Wiley & Sons, July 1996.
[24] K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne, “On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 320-344, 2000.
[25] E. Marcus and H. Stern, Blueprints for High Availability. John Wiley & Sons, 2000.
[26] M. Markou and S. Singh, “Novelty Detection: A Review—Part 1: Statistical Approaches,” Signal Processing, vol. 83, pp. 2481-2497, 2003.
[27] M. Markou and S. Singh, “Novelty Detection: A Review—Part 2: Neural Network Based Approaches,” Signal Processing, vol. 83, pp.2499-2521, 2003.
[28] R.A. Maxion and F.E. Feather, “A Case Study of Ethernet Anomalies in a Distributed Computing Environment,” IEEE Trans. Reliability, vol. 39, pp. 433-443, 1990.
[29] Sun Microsystems, Enterprise JavaBeans 3.0 Specification Documentation, Version 3.0 (Public Draft),, 2005.
[30] Sun Microsystems, J2EE Connector Architecture Specification, Version 1.5 (Public Draft),, 2005.
[31] D. Oppenheimer, G. Archana, and D.A. Patterson, “Why Do Internet Services Fail, and What Can Be Done about It?” Proc. Fourth Usenix Symp. Internet Technologies and Systems (USITS '03), Mar. 2003.
[32] D.P. Siewiorek, R. Chillarege, and Z.T. Kalbarczyk, “Reflections on Industry Trends and Experimental Research in Dependability,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, pp. 109-127, Jan.-Mar. 2004.
[33] A. Tanenbaum, Modern Operating Systems, second ed. Prentice Hall, 2001.
[34] D.M.J. Tax and R.P.W. Duin, “Support Vector Domain Description,” Pattern Recognition Letters, vol. 20, pp. 1191-1199, 1999.
[35] D. Titterington, A. Smith, and U. Makov, Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, 1985.
[36] K. Vaidyanathan and K.C. Gross, “Proactive Detection of Software Anomalies through MSET,” Proc. IEEE Workshop Predictive Software Models (PSM '04), 2004.
[37] A. Ward, P. Glynn, and K. Richardson, “Internet Service Performance Failure Detection,” ACM SIGMETRICS Performance Evaluation Rev., vol. 26, pp. 38-43, 1998.

Index Terms:
failure detection, subspace mapping, correlation analysis, autonomic computing, Internet services
Haifeng Chen, Guofei Jiang, Kenji Yoshihira, "Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 10, pp. 1308-1320, Oct. 2007, doi:10.1109/TKDE.2007.190633
Usage of this product signifies your acceptance of the Terms of Use.