This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Distance-Based Detection and Prediction of Outliers
February 2006 (vol. 18 no. 2)
pp. 145-160
A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it, called outlier detection solving set, that can be used to predict the outlierness of new unseen objects, is proposed. The solving set includes a sufficient number of points that permits the detection of the top outliers by considering only a subset of all the pairwise distances from the data set. The properties of the solving set are investigated, and algorithms for computing it, with subquadratic time requirements, are proposed. Experiments on synthetic and real data sets to evaluate the effectiveness of the approach are presented. A scaling analysis of the solving set size is performed, and the false positive rate, that is, the fraction of new objects misclassified as outliers using the solving set instead of the overall data set, is shown to be negligible. Finally, to investigate the accuracy in separating outliers from inliers, ROC analysis of the method is accomplished. Results obtained show that using the solving set instead of the data set guarantees a comparable quality of the prediction, but at a lower computational cost.

[1] F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,” IEEE Trans. Knowledge and Data Eng., vol. 2, no. 17, pp. 203-215, Feb. 2005.
[2] A. Arning, C. Aggarwal, and P. Raghavan, “A Linear Method for Deviation Detection in Large Databases,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 164-169, 1996.
[3] V. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley & Sons, 1994.
[4] S.D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '03), 2003.
[5] M.M. Breunig, H. Kriegel, R.T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” Proc. Int'l Conf. Managment of Data (SIGMOD '00), 2000.
[6] Defense Advanced Research Projects Agency DARPA, “Intrusion Detection Evaluation,” http://www.ll.mit.edu/IST/idevalindex. html , 1998.
[7] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, “A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data,” Applications of Data Mining in Computer Security, Kluwer, 2002.
[8] T. Fawcett and F. Provost, “Adaptive Fraud Detection,” Data Mining and Knowledge Discovery, vol. 1, pp. 291-316, 1997.
[9] C. Feng, A. Sutherland, S. King, S. Muggleton, and R. Henery, “Comparison of Machine Learning Classifiers to Statistics and Neural Networks,” Proc. AI & Stats Conf. 93, 1993.
[10] S. Floyd and M. Warmuth, “Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension,” Machine Learning, vol. 211, no. 3, pp. 269-304, 1995.
[11] V. Gaede and O. Günther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 170-231, 1998.
[12] M.R. Garey and D.S. Johnson, Computer and Intractability. New York: W.H. Freeman and Company, 1979.
[13] J. Han and M. Kamber, Data Mining, Concepts and Technique. San Francisco: Morgan Kaufmann, 2001.
[14] W. Jin, A.K.H. Tung, and J. Han, “Mining Top-n Local Outliers in Large Databases,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), 2001.
[15] E. Knorr and R. Ng, “Algorithms for Mining Distance-Based Outliers in Large Datasets,” Proc. Int'l Conf. Very Large Databases (VLDB '98), pp. 392-403, 1998.
[16] E. Knorr and R. Ng, “Finding Intensional Knowledge of Distance-Based Outliers,” Proc. Int'l Conf. Very Large Databases (VLDB '99), pp. 211-222, 1999.
[17] E. Knorr, R. Ng, and V. Tucakov, “Distance-Based Outlier: Algorithms and Applications,” VLDB J., vol. 8, nos. 3-4, pp. 237-253, 2000.
[18] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection,” Proc. SIAM Int'l Conf. Data Mining (SIAM-03), 2003.
[19] W. Lee, S.J. Stolfo, and K.W. Mok, “Mining Audit Data to Build Intrusion Detection Models,” Proc. Int'l Conf Knowledge Discovery and Data Mining (KDD-98), pp. 66-72, 1998.
[20] L. Mangasarian and W.H. Wolberg, “Cancer Diagnosis via Linear Programming,” SIAM News, vol. 25, no. 5, pp. 1-18, 1990.
[21] D. Peleg, G. Schechtman, and A. Wool, “Randomized Approximation of Bounded Multicovering Problems,” Algorithmica, vol. 18, no. 1, pp. 44-66, 1997.
[22] F. Provost, T. Fawcett, and R. Kohavi, “The Case Against Accuracy Estimation for Comparing Induction Algorithms,” Proc. Int'l Conf. Machine Learning (ICML '98), 1998.
[23] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. Int'l Conf. Managment of Data (SIGMOD '00), pp. 427-438, 2000.
[24] B. Schölkopf, C. Burges, and V. Vapnik, “Extracting Support Data for a Given Task,” Proc First Int'l Conf. Knowledge Discovery and Data Mining, pp. 252-257, 1995.
[25] L. Torgo and R. Ribeiro, “Predicting Outliers,” Proc. Int'l Conf. Principles of Data Mining and Knowledge Discovery (PKDD '03), pp. 447-458, 2003.
[26] K. Yamanishi and J. Takeuchi, “Discovering Outlier Filtering Rules from Unlabeled Data,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), pp. 389-394, 2001.

Index Terms:
Index Terms- Distance-based outliers, outlier detection, outlier prediction, data mining.
Citation:
Fabrizio Angiulli, Stefano Basta, Clara Pizzuti, "Distance-Based Detection and Prediction of Outliers," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 2, pp. 145-160, Feb. 2006, doi:10.1109/TKDE.2006.29
Usage of this product signifies your acceptance of the Terms of Use.