The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2013 vol.25)
pp: 589-602
Shu Wu , Chinese Academy of Sciences, Beijing
Shengrui Wang , University of Sherbrooke, Sherbrooke
ABSTRACT
Outlier detection can usually be considered as a pre-processing step for locating, in a data set, those objects that do not conform to well-defined notions of expected behavior. It is very important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional phenomena, etc. We are investigating outlier detection for categorical data sets. This problem is especially challenging because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, we propose a formal definition of outliers and an optimization model of outlier detection, via a new concept of holoentropy that takes both entropy and total correlation into consideration. Based on this model, we define a function for the outlier factor of an object which is solely determined by the object itself and can be updated efficiently. We propose two practical 1-parameter outlier detection methods, named ITB-SS and ITB-SP, which require no user-defined parameters for deciding whether an object is an outlier. Users need only provide the number of outliers they want to detect. Experimental results show that ITB-SS and ITB-SP are more effective and efficient than mainstream methods and can be used to deal with both large and high-dimensional data sets where existing algorithms fail.
INDEX TERMS
Information retrieval, Search methods, Mutual information, Greedy algorithms, Complexity theory, Holoentropy, greedy algorithms, Outlier detection, holoentropy, total correlation, outlier factor, attribute weighting
CITATION
Shu Wu, Shengrui Wang, "Information-Theoretic Outlier Detection for Large-Scale Categorical Data", IEEE Transactions on Knowledge & Data Engineering, vol.25, no. 3, pp. 589-602, March 2013, doi:10.1109/TKDE.2011.261
REFERENCES
[1] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Surveys, vol. 41, no. 3, pp. 1-58, 2009.
[2] V.J. Hodge and J. Austin, "A Survey of Outlier Detection Methodologies," Artificial Intelligence Rev., vol. 22, no. 2, pp. 85-126, 2004.
[3] E.M. Knorr and R.T. Ng, "Algorithms for Mining Distance-Based Outliers in Large Data Sets," Proc. 24rd Int'l Conf. Very Large Data Bases (VLDB '98), 1998.
[4] S.R. Gaddam, V.V. Phoha, and K.S. Balagani, "K-Means+ID3: A Novel Method for Supervised Anomaly Detection by Cascading K-Means Clustering and ID3 Decision Tree Learning Methods," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 345-354, Mar. 2007.
[5] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, "Semi-Supervised Adapted HMMs for Unusual Event Detection," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition (CVPR '05), 2005.
[6] T. Cover and J. Thomas, Elements of Information Theory. John Wiley & Sons, 1991.
[7] M.E. Otey, A. Ghoting, and S. Parthasarathy, "Fast Distributed Outlier Detection in Mixed-Attribute Data Sets," Data Mining and Knowledge Discovery, vol. 12, pp. 203-228, 2006.
[8] K. Das and J. Schneider, "Detecting Anomalous Records in Categorical Data Sets," Proc. 13th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '07), 2007.
[9] K. Das, J. Schneider, and D.B. Neill, "Anomaly Pattern Detection in Categorical Data Sets," Proc. 14th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '08), 2008.
[10] Z. He, X. Xu, Z.J. Huang, and S. Deng, "FP-Outlier: Frequent Pattern Based Outlier Detection," Computer Science and Information Systems, vol. 2, pp. 103-118, 2005.
[11] S. Li, R. Lee, and S. Lang, "Mining Distance-Based Outliers from Categorical Data," Proc. IEEE Seventh Int'l Conf. Data Mining Workshops (ICDM '07), 2007.
[12] R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules Between Sets of Items in Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '93), 1993.
[13] C.C. Aggarwal and P.S. Yu, "Outlier Detection for High Dimensional Data," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '01), 2001.
[14] X. Wang and I. Davidson, "Discovering Contexts and Contextual Outliers Using Random Walks in Graphs," Proc. IEEE Ninth Int'l Conf. Data Mining (ICDM '09), 2009.
[15] T.G. Dietterich, "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms," Neural Computation, vol. 10, no. 7, pp. 1895-1923, 1998.
[16] S. Srinivasa, "A Review on Multivariate Mutual Information," Univ. of Notre Dame, Notre Dame, Indiana, vol. 2, pp. 1-6, 2005.
[17] S. Watanabe, "Information Theoretical Analysis of Multivariate Correlation," IBM J. Research and Development, vol. 4, pp. 66-82, 1960.
[18] L. Wei, W. Qian, A. Zhou, W. Jin, and J.X. Yu, "HOT: Hypergraph-Based Outlier Test for Categorical Data," Proc. Seventh Pacific-Asia Conf. Advances in Knowledge Discovery and Data Mining (PAKDD '03), 2003.
[19] M. Breunig, H-P. Kriegel, R. Ng, and J. Sander, "LOF: Identifying Density-Based Local Outliers," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '00), 2000.
[20] P.K. Chan, M.V. Mahoney, and M.H. Arshad, "A Machine Learning Approach to Anomaly Detection," technical report, Florida Inst. of Tech nology, 2003.
[21] M. Fox, G. Gramajo, A. Koufakou, and M. Georgiopoulos, "Detecting Outliers in Categorical Data Sets Using Non-Derivable Itemsets," Technical Report, The AMALTHEA REU Program, 2008.
[22] J. Han and M. Kamber, Data Mining—Concepts and Techniques. Elsevier, 2006.
[23] Z. He, X. Xu, and S. Deng, "An Optimization Model for Outlier Detection in Categorical Data," Proc. Int'l Conf. Advances in Intelligent Computing (ICIC '05), 2005.
[24] S. Papadimitriou, H. Kitagawa, P.B. Gibbons, and C. Faloutsos, "Loci: Fast Outlier Detection Using Thelocal Correlation Integral," Proc. 19th Int'l Conf. Data Eng. (ICDE '03), 2003.
[25] J. Takeuchi and K. Yamanishi, "A Unifying Framework for Detecting Outliers and Change Points from Time Series," IEEE Trans. Knowledge and Data Eng., vol. 18, no. 4, pp. 482-492, Apr. 2006.
[26] G.D. Battista, P. Eades, R. Tamassia, and I.G. Tollis, "Algorithms for Drawing Graphs: An Annotated Bibliography," Computational Geometry: Theory and Applications, vol. 4, pp. 235 282, 1994.
[27] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly Detection for Discrete Sequences: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 24, no. 5, pp. 823-839, May 2012.
[28] T. Leckie and A. Yasinsac, "Metadata for Anomaly-Based Security Protocol Attack Deduction," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp. 1157-1168, Sept. 2004.
[29] X. Song, M. Wu, C. Jermaine, and S. Ranka, "Conditional Anomaly Detection," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 5, pp. 631-645, May 2007.
[30] F. Angiulli, S. Basta, and C. Pizzuti, "Distance-Based Detection and Prediction of Outliers," IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 145-160, Feb. 2006.
[31] F. Angiulli and C. Pizzuti, "Outlier Mining in Large High-Dimensional Data Sets," IEEE Trans. Knowledge and Data Eng., vol. 17, no. 2, pp. 203-215, Feb. 2005.
[32] S.-d. Lin and H. Chalupsky, "Discovering and Explaining Abnormal Nodes in Semantic Graphs," IEEE Trans. Knowledge and Data Eng., vol. 20, no. 8, pp. 1039-1052, Aug. 2008.
[33] S.D. Bay and M. Schwabacher, "Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule," Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '03), 2003.
[34] H.D.K. Moonesignhe and P. Tan, "Outlier Detection Using Random Walks," Proc. IEEE 18th Int'l Conf. Tools with Artificial Intelligence (ICTAI '06), 2006.
[35] J.X. Yu, W. Qian, H. Lu, and A. Zhou, "Finding Centric Local Outliers in Categorical/Numerical Spaces," Knowledge and Information Systems, vol. 9, no. 3, pp 309-338, 2006.
[36] W. Lee and D. Xiang, "Information-Theoretic Measures for Anomaly Detection," Proc. IEEE Symp. Security and Privacy, 2001.
[37] Z. He, X. Xu, and S. Deng, "Discovering Cluster-Based Local Outliers," Pattern Recognition Letters, vol. 24, pp. 1641-1650, 2003.
[38] D.M.J. Tax and R.P.W. Duin, "Support Vector Domain Description," Pattern Recognition Letters, vol. 20, nos. 11-13, pp. 1191-1199, 1999.
[39] B. Scholkopf, J.C. Platt, J.S. Taylor, A.J. Smola, and R.C. Williamson, "Estimating the Support of a High-Dimensional Distribution," Neural Computation, vol. 13, no. 7, pp. 1443-1471, 2001.
[40] M. Filippone and G. Sanguinetti, "Information Theoretic Novelty Detection," Pattern Recognition, vol. 43, pp. 805-814, 2010.
[41] L. Itti and P. Baldi, "Bayesian Surprise Attracts Human Attention," Proc. Neural Information Processing Systems Conf. (NIPS '05), 2005.
[42] D. Barbará, C. Domeniconi, and J.P. Rogers, "Detecting Outliers Using Transduction and Statistical Testing," Proc. 12th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '06), 2006.
[43] W. Jin, A.K.T. Tung, and J. Han, "Mining Top-n Local Outliers in Large Databases," Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), 2001.
[44] W. Jin, A.K.T. Tung, J. Han, and W. Wang, "Ranking Outlier Using Symmetric Neighborhood Relationship," Proc. 10th Pacific-Asia Conf. Advances in Knowledge Discovery and Data Mining (PAKDD '06), 2006.
[45] E. Aleskerov, B. Freisleben, and B. Rao Cardwatch, "A Neural Network Based Database Mining System for Credit Card Fraud Detection," Proc. IEEE/IAFE Computational Intelligence for Financial Eng. Conf. (CIFEr '97), 1997.
[46] J. Gao, H. Cheng, and P.N. Tan, "Semi-Supervised Outlier Detection," Proc. ACM Symp. Applied Computing (SAC '06), 2006.
[47] H.P. Kriegel, P. Kroger, and A. Zimek, "Outlier Detection Techniques," Proc. ACM Symp. Applied Computing (SDM '10), 2010.
[48] J. Dougherty, R. Kohavi, and M. Sahami, "Supervised and Unsupervised Discretization of Continuous Features," Proc. Int'l Conf. Machine Learning (ICML '05), 2005.
[49] http://www.cs.umb.edu/dana/GAClustindex.html , 2012.
[50] UCI Machine Learning Repository, http://www.ics.uci.edu/mlearnMLRepository.html , 2011.
[51] http:/www.data setgenerator.com/, 2011.
41 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool