Subscribe

Issue No.01 - January (2010 vol.22)

pp: 16-30

Yi-Hong Chu , National Taiwan University, Taipei

Jen-Wei Huang , Yuan Ze University, Chung-Li

Kun-Ta Chuang , National Taiwan University, Taipei

De-Nian Yang , Academia Sinica, Nankang

Ming-Syan Chen , Academia Sinica, Nankang, and National Taiwan University, Taipei

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.224

ABSTRACT

Instead of finding clusters in the full feature space, subspace clustering is an emergent task which aims at detecting clusters embedded in subspaces. Most of previous works in the literature are density-based approaches, where a cluster is regarded as a high-density region in a subspace. However, the identification of dense regions in previous works lacks of considering a critical problem, called "the density divergence problem” in this paper, which refers to the phenomenon that the region densities vary in different subspace cardinalities. Without considering this problem, previous works utilize a density threshold to discover the dense regions in all subspaces, which incurs the serious loss of clustering accuracy (either recall or precision of the resulting clusters) in different subspace cardinalities. To tackle the density divergence problem, in this paper, we devise a novel subspace clustering model to discover the clusters based on the relative region densities in the subspaces, where the clusters are regarded as regions whose densities are relatively high as compared to the region densities in a subspace. Based on this idea, different density thresholds are adaptively determined to discover the clusters in different subspace cardinalities. Due to the infeasibility of applying previous techniques in this novel clustering model, we also devise an innovative algorithm, referred to as DENCOS (DENsity COnscious Subspace clustering), to adopt a divide-and-conquer scheme to efficiently discover clusters satisfying different density thresholds in different subspace cardinalities. As validated by our extensive experiments on various data sets, DENCOS can discover the clusters in all subspaces with high quality, and the efficiency of DENCOS outperformes previous works.

INDEX TERMS

Data mining, data clustering, subspace clustering.

CITATION

Yi-Hong Chu, Jen-Wei Huang, Kun-Ta Chuang, De-Nian Yang, Ming-Syan Chen, "Density Conscious Subspace Clustering for High-Dimensional Data",

*IEEE Transactions on Knowledge & Data Engineering*, vol.22, no. 1, pp. 16-30, January 2010, doi:10.1109/TKDE.2008.224REFERENCES

- [1] C.C. Aggarwal, A. Hinneburg, and D. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Space,”
Proc. Eighth Int'l Conf. Database Theory (ICDT), 2001.- [2] C.C. Aggarwal and C. Procopiuc, “Fast Algorithms for Projected Clustering,”
Proc. ACM SIGMOD Int'l Conf. Management of Data, 1999.- [3] C.C. Aggarwal and P.S. Yu, “The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in High Dimensional Space,”
Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2000.- [4] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,”
Proc. ACM SIGMOD Int'l Conf. Management of Data, 1998.- [5] I. Assent, R. Krieger, E. Muller, and T. Seidl, “DUSC: Dimensionality Unbiased Subspace Clustering,”
Proc. IEEE Int'l Conf. Data Mining (ICDM), 2007.- [6] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is Nearest Neighbors Meaningful?”
Proc. Seventh Int'l Conf. Database Theory (ICDT), 1999.- [7] A. Blum and P. Langley, “Selection of Relevant Features and Examples in Machine Learning,”
Artificial Intelligence, vol. 97, pp.245-271, 1997.- [9] C.H. Cheng, A.W. Fu, and Y. Zhang, “Entropy-Based Subspace Clustering for Mining Numerical Data,”
Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 1999.- [10] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,”
Proc. Second Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD), 1996.- [11] H. Fang, C. Zhai, L. Liu, and J. Yang, “Subspace Clustering for Microarray Data Analysis: Multiple Criteria and Significance,”
Proc. IEEE Computational Systems Bioinformatics Conf., 2004.- [12] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurasamy,
Advances in Knowledge Discovery and Data Mining. MIT Press, 1996.- [13] J. Han and M. Kamber,
Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.- [14] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,”
Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data, 2000.- [15] S. Hettich and S. Bay, “The UCI KDD Archive,” http:/kdd.ics.uci.edu, 1999.
- [16] A. Hinneburg, C.C. Aggarwal, and D. Keim, “What is the Nearest Neighbor in High Dimensional Spaces?”
Proc. 26th Int'l Conf. Very Large Data Bases (VLDB), 2000.- [17] K. Kailing, H.-P. Kriegel, and P. Kroger, “Density-Connected Subspace Clustering for High-Dimensional Data,”
Proc. Fourth SIAM Int'l Conf. Data Mining (SDM), 2004.- [18] Y.B. Kim, J.H. Oh, and J. Gao, “Emerging Pattern Based Subspace Clustering of Microarray Gene Expression Data Using Mixture Models,”
Proc. Int'l Conf. Bioinformatics and Its Applications (ICBA), 2004.- [19] H.-P. Kriegel, P. Kroger, M. Renz, and S. Wurst, “A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data,”
Proc. Fifth IEEE Int'l Conf. Data Mining (ICDM), 2005.- [20] H. Liu and H. Motoda,
Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998.- [21] L. Lu and R. Vidal, “Combined Central and Subspace clustering for Computer Vision Applications,”
Proc. 23rd Int'l Conf. Machine Learning (ICML), 2006.- [22] G. Moise, J. Sander, and M. Ester, “P3C: A Robust Projected Clustering Algorithm,”
Proc. Sixth IEEE Int'l Conf. Data Mining (ICDM), 2006.- [23] H.S. Nagesh, S. Goil, and A. Choudhary, “Adaptive Grids for Clustering Massive Data Sets,”
Proc. First SIAM Int'l Conf. Data Mining (SDM), 2001.- [25] C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, “A Monte Carlo Algorithm for Fast Projective Clustering,”
Proc. 2002 ACM SIGMOD Int'l Conf. Management of Data, 2002.- [26] UCI Repository of Machine Learning Databases, http://mlaern. ics.uci.eduMLRepository.html , 1998.
- [28] M.L. Yiu and N. Mamoulis, “Frequent-Pattern Based Iterative Projected Clustering,”
Proc. Third IEEE Int'l Conf. Data Mining (ICDM), 2003. |