Subscribe
Issue No.03 - March (2014 vol.26)
pp: 739-751
Nenad Tomasev , Jozef Stefan Institute, Artificial Intelligence Laboratory and Jozef Stefan International Postgraduate School, Ljubljana
Dunja Mladenic , Jozef Stefan Institute, Artificial Intelligence Laboratory and Jozef Stefan International Postgraduate School, Ljubljana
High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. In this paper, we take a novel perspective on the problem of clustering high-dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower dimensional feature subspace, we embrace dimensionality by taking advantage of inherently high-dimensional phenomena. More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in $(k)$-nearest-neighbor lists of other points, can be successfully exploited in clustering. We validate our hypothesis by demonstrating that hubness is a good measure of point centrality within a high-dimensional data cluster, and by proposing several hubness-based clustering algorithms, showing that major hubs can be used effectively as cluster prototypes or as guides during the search for centroid-based cluster configurations. Experimental results demonstrate good performance of our algorithms in multiple settings, particularly in the presence of large quantities of noise. The proposed methods are tailored mostly for detecting approximately hyperspherical clusters and need to be extended to properly handle clusters of arbitrary shapes.
 [1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, second ed. Morgan Kaufmann, 2006. [2] C.C. Aggarwal and P.S. Yu, "Finding Generalized Projected Clusters in High Dimensional Spaces," Proc. 26th ACM SIGMOD Int'l Conf. Management of Data, pp. 70-81, 2000. [3] K. Kailing, H.-P. Kriegel, P. Kröger, and S. Wanka, "Ranking Interesting Subspaces for Clustering High Dimensional Data," Proc. Seventh European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 241-252, 2003. [4] K. Kailing, H.-P. Kriegel, and P. Kröger, "Density-Connected Subspace Clustering for High-Dimensional Data," Proc. Fourth SIAM Int'l Conf. Data Mining (SDM), pp. 246-257, 2004. [5] E. Müller, S. Günnemann, I. Assent, and T. Seidl, "Evaluating Clustering in Subspace Projections of High Dimensional Data," Proc. VLDB Endowment, vol. 2, pp. 1270-1281, 2009. [6] C.C. Aggarwal, A. Hinneburg, and D.A. Keim, "On the Surprising Behavior of Distance Metrics in High Dimensional Spaces," Proc. Eighth Int'l Conf. Database Theory (ICDT), pp. 420-434, 2001. [7] D. François, V. Wertz, and M. Verleysen, "The Concentration of Fractional Distances," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 7, pp. 873-886, July 2007. [8] R.J. Durrant and A. Kabán, "When Is 'Nearest Neighbour' Meaningful: A Converse Theorem and Implications," J. Complexity, vol. 25, no. 4, pp. 385-397, 2009. [9] A. Kabán, "Non-Parametric Detection of Meaningless Distances in High Dimensional Data," Statistics and Computing, vol. 22, no. 2, pp. 375-385, 2012. [10] E. Agirre, D. Martínez, O.L. de Lacalle, and A. Soroa, "Two Graph-Based Algorithms for State-of-the-Art WSD," Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), pp. 585-593, 2006. [11] K. Ning, H. Ng, S. Srihari, H. Leong, and A. Nesvizhskii, "Examination of the Relationship between Essential Genes in PPI Network and Hub Proteins in Reverse Nearest Neighbor Topology," BMC Bioinformatics, vol. 11, pp. 1-14, 2010. [12] D. Arthur and S. Vassilvitskii, "K-Means++: The Advantages of Careful Seeding," Proc. 18th Ann. ACM-SIAM Symp. Discrete Algorithms (SODA), pp. 1027-1035, 2007. [13] I.S. Dhillon, Y. Guan, and B. Kulis, "Kernel k-Means: Spectral Clustering and Normalized Cuts," Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 551-556, 2004. [14] T.N. Tran, R. Wehrens, and L.M.C. Buydens, "Knn Density-Based Clustering for High Dimensional Multispectral Images," Proc. Second GRSS/ISPRS Joint Workshop Remote Sensing and Data Fusion over Urban Areas, pp. 147-151, 2003. [15] E. Biçici and D. Yuret, "Locally Scaled Density Based Clustering," Proc. Eighth Int'l Conf. Adaptive and Natural Computing Algorithms (ICANNGA), Part I, pp. 739-748, 2007. [16] C. Zhang, X. Zhang, M.Q. Zhang, and Y. Li, "Neighbor Number, Valley Seeking and Clustering," Pattern Recognition Letters, vol. 28, no. 2, pp. 173-180, 2007. [17] S. Hader and F.A. Hamprecht, "Efficient Density Clustering Using Basin Spanning Trees," Proc. 26th Ann. Conf. Gesellschaft für Klassifikation, pp. 39-48, 2003. [18] C. Ding and X. He, "K-Nearest-Neighbor Consistency in Data Clustering: Incorporating Local Information into Global Optimization," Proc. ACM Symp. Applied Computing (SAC), pp. 584-589, 2004. [19] C.-T. Chang, J.Z.C. Lai, and M.D. Jeng, "Fast Agglomerative Clustering Using Information of k-Nearest Neighbors," Pattern Recognition, vol. 43, no. 12, pp. 3958-3968, 2010. [20] N. Tomašev, M. Radovanović, D. Mladenić, and M. Ivanović, "Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification," Proc. Seventh Int'l Conf. Machine Learning and Data Mining (MLDM), pp. 16-30, 2011. [21] N. Tomašev, M. Radovanović, D. Mladenić, and M. Ivanović, "A Probabilistic Approach to Nearest-Neighbor Classification: Naive Hubness Bayesian kNN," Proc. 20th ACM Int'l Conf. Information and Knowledge Management (CIKM), pp. 2173-2176, 2011. [22] M. Radovanović, A. Nanopoulos, and M. Ivanović, "Time-Series Classification in Many Intrinsic Dimensions," Proc. 10th SIAM Int'l Conf. Data Mining (SDM), pp. 677-688, 2010. [23] M. Radovanović, A. Nanopoulos, and M. Ivanović, "Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data," J. Machine Learning Research, vol. 11, pp. 2487-2531, 2010. [24] N. Tomašev and D. Mladenić, "Nearest Neighbor Voting in High Dimensional Data: Learning from Past Occurrences," Computer Science and Information Systems, vol. 9, no. 2, pp. 691-712, 2012. [25] N. Tomašev, R. Brehar, D. Mladenić, and S. Nedevschi, "The Influence of Hubness on Nearest-Neighbor Methods in Object Recognition," Proc. IEEE Seventh Int'l Conf. Intelligent Computer Comm. and Processing (ICCP), pp. 367-374, 2011. [26] K. Buza, A. Nanopoulos, and L. Schmidt-Thieme, "INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification," Proc. 15th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), Part II, pp. 149-160, 2011. [27] A. Nanopoulos, M. Radovanović, and M. Ivanović, "How Does High Dimensionality Affect Collaborative Filtering?" Proc. Third ACM Conf. Recommender Systems (RecSys), pp. 293-296, 2009. [28] M. Radovanović, A. Nanopoulos, and M. Ivanović, "On the Existence of Obstinate Results in Vector Space Models," Proc. 33rd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 186-193, 2010. [29] J.J. Aucouturier and F. Pachet, "Improving Timbre Similarity: How High Is the Sky?" J. Negative Results in Speech and Audio Sciences, vol. 1, 2004. [30] J.J. Aucouturier, "Ten Experiments on the Modelling of Polyphonic Timbre," PhD dissertation, Univ. of Paris 6, 2006. [31] D. Schnitzer, A. Flexer, M. Schedl, and G. Widmer, "Local and Global Scaling Reduce Hubs in Space," J. Machine Learning Research, vol. 13, pp. 2871-2902, 2012. [32] S. France and D. Carroll, "Is the Distance Compression Effect Overstated? Some Theory and Experimentation," Proc. Sixth Int'l Conf. Machine Learning and Data Mining in Pattern Recognition (MLDM), pp. 280-294, 2009. [33] J. Chen, H. Fang, and Y. Saad, "Fast Approximate $k$ NN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection," J. Machine Learning Research, vol. 10, pp. 1989-2012, 2009. [34] V. Satuluri and S. Parthasarathy, "Bayesian Locality Sensitive Hashing for Fast Similarity Search," Proc. VLDB Endowment, vol. 5, no. 5, pp. 430-441, 2012. [35] D. Corne, M. Dorigo, and F. Glover, New Ideas in Optimization. McGraw-Hill, 1999. [36] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison Wesley, 2005. [37] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu, "Density-Based Clustering in Spatial Databases: The Algorithm Gdbscan and Its Applications," Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169-194, 1998. [38] N. Tomašev, M. Radovanović, D. Mladenić, and M. Ivanović, "The Role of Hubness in Clustering High-Dimensional Data," Proc. 15th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), Part I, pp. 183-195, 2011. [39] G. Frederix and E.J. Pauwels, "Shape-Invariant Cluster Validity Indices," Proc. Fourth Industrial Conf. Data Mining (ICDM), pp. 96-105, 2004. [40] D. Lowe, "Object Recognition from Local Scale-Invariant Features," Proc. IEEE Seventh Int'l Conf. Computer Vision (ICCV), vol. 2, pp. 1150-1157, 1999. [41] Z. Zhang and R. Zhang, Multimedia Data Mining. Chapman and Hall, 2009.