This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Outlier Mining in Large High-Dimensional Data Sets
February 2005 (vol. 17 no. 2)
pp. 203-215
In this paper, a new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d + 1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers that remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.

[1] C.C. Aggarwal and P.S. Yu, “Outlier Detection for High Dimensional Data,” Proc. ACM Int'l Conf. Managment of Data (SIGMOD '01), pp. 37-46, 2001.
[2] F. Angiulli and C. Pizzuti, “Fast Outlier Detection in High Dimensional Spaces,” Proc. Int'l Conf. Principles of Data Mining and Knowledge Discovery (PKDD '02), pp. 15-26, 2002.
[3] A. Arning, R. Aggarwal, and P. Raghavan, “A Linear Method for Deviation Detection in Large Databases,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 164-169, 1996.
[4] V. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley & Sons, 1994.
[5] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is ‘Nearest Neighbor’ Meaningful?” Proc. Int'l Conf. Database Theory (ICDT '99), pp. 217-235, 1999.
[6] M.M. Breunig, H. Kriegel, R.T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” Proc. ACM Int'l Conf. Managment of Data (SIGMOD '00), pp. 93-104, 2000.
[7] C.E. Brodley and M. Friedl, “Identifying and Eliminating Mislabeled Training Instances,” Proc. Nat'l Am. Conf. Artificial Intelligence (AAAI/IAAI '96), pp. 799-805, 1996.
[8] A.R. Butz, “Alternative Algorithm for Hilbert's Space-Filling Curve,” IEEE Trans. Computers, pp. 424-426, Apr. 1971.
[9] T. Chan, “Approximate Nearest Neighbor Queries Revisited,” Proc. Ann. ACM Symp. Computational Geometry (SoCG '97), pp. 352-358, 1997.
[10] C. Faloutsos, “Multiattribute Hashing Using Gray Codes,” Proc. ACM Int'l Conf. Managment of Data (SIGMOD '86), pp. 227-238, 1986.
[11] C. Faloutsos and S. Roseman, “Fractals for Secondary Key Retrieval,” Proc. ACM Int'l Conf. Principles of Database Systems (PODS '89), pp. 247–252, 1989.
[12] J. Han and M. Kamber, Data Mining, Concepts and Techniques. San Francisco: Morgan Kaufmann, 2001.
[13] H.V. Jagadish, “Linear Clustering of Objects with Multiple Attributes,” Proc. ACM Int'l Conf. Managment of Data (SIGMOD '90), pp. 332-342, 1990.
[14] W. Jin, A.K.H. Tung, and J. Han, “Mining Top-n Local Outliers in Large Databases,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), pp. 293-298, 2001.
[15] E. Knorr and R. Ng, “Algorithms for Mining Distance-Based Outliers in Large Datasets,” Proc. Int'l Conf. Very Large Databases (VLDB '98), pp. 392-403, 1998.
[16] E. Knorr, R. Ng, and V. Tucakov, “Distance-Based Outlier: Algorithms and Applications,” VLDB J., vol. 8, nos. 3-4, pp. 237-253, 2000.
[17] D.E. Knuth, The Art of Computer Programming, Vol. 3— Sorting and Searching. Reading, Mass.: Addison-Wesley, 1973.
[18] J.K. Lawder, “Calculation of Mappings between One and $n$ -Dimensional Values Using the Hilbert Space-Filling Curve,” Research Report BBKCS-00-01, pp. 1-13, 2000.
[19] W. Lee, S.J. Stolfo, and K.W. Mok, “Mining Audit Data to Build Intrusion Detection Models,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '98), pp. 66-72, 1998.
[20] S. Liao, M. Lopez, and S. Leutenegger, “High Dimensional Similarity Search with Space Filling Curves,” Proc. Int'l Conf. Data Eng. (ICDE '01), pp. 615-622, 2001.
[21] M. Lopez and S. Liao, “Finding $k$ -Closest-Pairs Efficiently for High Dimensional Data,” Proc. Canadian Conf. Computational Geometry (CCCG '00), pp. 197-204, 2000.
[22] B. Moon, H.V. Jagadish, C. Faloutsos, and J.H. Saltz, “Analysis of the Clustering Properties of Hilbert Space-Filling Curve,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 1, pp. 124-141, Jan./Feb. 2001.
[23] D. Moore, Fast Hilbert Curve Generation, Sorting, and Range Queries, http://www.caam.rice.edu/~dougm/twiddleHilbert /, 2004.
[24] F.P. Preparata and M.I. Shamos, Computational Geometry An Introduction. New York: Springer-Verlag, 1985.
[25] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. ACM Int'l Conf. Managment of Data (SIGMOD '00), pp. 427-438, 2000.
[26] S. Rosset, U. Murad, E. Neumann, Y. Idan, and G. Pinkas, “Discovery of Fraud Rules for Telecommunications-Challenges and Solutions,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '99), pp. 409-413, 1999.
[27] H. Sagan, Space Filling Curves. Springer-Verlag, 1994.
[28] J. Shepherd, X. Zhu, and N. Megiddo, “A Fast Indexing Method for Multidimensional Nearest Neighbor Search,” Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases VII, pp. 350-355, 1999.
[29] Z.R. Struzik and A. Siebes, “Outliers Detection and Localisation with Wavelet Based Multifractal Formalism,” technical report, CWI, Amsterdam, INS-R0008, 2000.
[30] K. Yamanishi and J. Takeuchi, “Discovering Outlier Filtering Rules from Unlabeled Data,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), pp. 389-394, 2001.
[31] K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne, “On-Line Unsupervised Learning Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 250–254, 2000.
[32] D. Yu, G. Sheikholeslami, and A. Zhang, “Findout: Finding Outliers in Very Large Datasets,” Knowledge and Information Systems, vol. 4, no. 3, pp. 387-412, 2002.

Index Terms:
Outlier mining, space-filling curves.
Citation:
Fabrizio Angiulli, Clara Pizzuti, "Outlier Mining in Large High-Dimensional Data Sets," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 2, pp. 203-215, Feb. 2005, doi:10.1109/TKDE.2005.31
Usage of this product signifies your acceptance of the Terms of Use.