Subscribe

Issue No.10 - October (2009 vol.21)

pp: 1432-1446

Ying-Ju Chen , National Taiwan University, Taipei

De-Nian Yang , National Taiwan University, Taipei

Ming-Syan Chen , National Taiwan University, Taipei

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.207

ABSTRACT

In this paper, we first study an important but unsolved dilemma in the literature of subspace clustering, which is referred to as “information overlapping-data coverage” challenge. Current solutions of subspace clustering usually invoke a grid-based Apriori-like procedure to identify dense regions and construct subspace clusters afterward. Due to the nature of monotonicity property in Apriori-like procedures, it is inherent that if a region is identified as dense, all its projected regions are also identified as dense, causing overlapping/redundant clustering information to be inevitably reported to users when generating clusters from such highly correlated regions. However, naive methods to filter redundant clusters will incur a challenging problem in the other side of the dilemma, called the “data coverage” issue. Note that two clusters may have highly correlated dense regions but their data members could be highly different to each other. Arbitrarily removing one of them may lose the coverage of data with clustering information, thus likely reporting an incomplete and biased clustering result. In this paper, therefore, we further propose an innovative algorithm, called "NOnRedundant Subspace Cluster mining” (NORSC), to efficiently discover a succinct collection of subspace clusters while also maintaining the required degree of data coverage. NORSC not only avoids generating the redundant clusters with most of the contained data covered by higher dimensional clusters to resolve the information overlapping problem but also limits the information loss to cope with the data coverage problem. As shown by our experimental results, NORSC is very effective in identifying a concise and small set of subspace clusters, while incurring time complexity in orders of magnitude better than that of previous works.

INDEX TERMS

Data mining, subspace clustering, redundancy filtering.

CITATION

Ying-Ju Chen, De-Nian Yang, Ming-Syan Chen, "Reducing Redundancy in Subspace Clustering",

*IEEE Transactions on Knowledge & Data Engineering*, vol.21, no. 10, pp. 1432-1446, October 2009, doi:10.1109/TKDE.2008.207REFERENCES

- [1] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for Projected Clustering of High Dimensional Data Streams,”
Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.- [2] C.C. Aggarwal, A. Hinneburg, and D. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Space,”
Proc. Eighth Int'l Conf. Database Theory (ICDT), 2001.- [3] C.C. Aggarwal and C. Procopiuc, “Fast Algorithms for Projected Clustering,”
Proc. ACM SIGMOD, 1999.- [4] C.C. Aggarwal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,”
Proc. ACM SIGMOD, 2000.- [5] C.C. Aggarwal and P.S. Yu, “The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in High Dimensional Space,”
Proc. ACM SIGKDD, 2000.- [6] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,”
Proc. ACM SIGMOD, 1998.- [7] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,”
Proc. 20th Int'l Conf. Very Large Data Bases (VLDB), 1994.- [8] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is Nearest Neighbors Meaningful?”
Proc. Seventh Int'l Conf. Database Theory (ICDT), 1999.- [9] M.-S. Chen, J. Han, and P.S. Yu, “Data Mining: An Overview from Database Perspective,”
IEEE Trans. Knowledge and Data Eng., 1996.- [10] C.H. Cheng, A.W. Fu, and Y. Zhang, “Entropy-Based Subspace Clustering for Mining Numerical Data,”
Proc. ACM SIGKDD, 1999.- [11] Y.-H. Chu, J.-W. Huang, K.-T. Chuang, and M.-S. Chen, “On Subspace Clustering with Density Consciousness,”
Proc. ACM Int'l Conf. Information and Knowledge Management (CIKM), 2006.- [12] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,”
Proc. ACM SIGKDD, 1996.- [13] H. Fang, C. Zhai, L. Liu, and J. Yang, “Subspace Clustering for Microarray Data Analysis: Multiple Criteria and Significance,”
Proc. Computational Systems Bioinformatics Conf. (CSB), 2004.- [14] S. Goil, H. Nagesh, and A. Choudhary, “MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets,” technical report, Northwestern Univ., 1999.
- [15] J. Han and M. Kamber,
Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.- [16] A. Hinneburg, C.C. Aggarwal, and D. Keim, “What is the Nearest Neighbor in High Dimensional Spaces?”
Proc. 26th Int'l Conf. Very Large Data Bases (VLDB), 2000.- [17] K. Kailing, H.-P. Kriegel, and P. Kroger, “Density-Connected Subspace Clustering for High-Dimensional Data,”
Proc. Fourth IEEE Int'l Conf. Data Mining (ICDM), 2004.- [18] Y.B. Kim, J.H. Oh, and J. Gao, “Emerging Pattern Based Subspace Clustering of Microarray Gene Expression Data Using Mixture Models,”
Proc. 23rd Int'l Conf. Machine Learning (ICML), 2006.- [19] J. Liu, K. Strohmaier, and W. Wang, “Revealing True Subspace Clusters in High Dimensions,”
Proc. Fourth IEEE Int'l Conf. Data Mining (ICDM), 2004.- [20] L. Lu and R. Vidal, “Combined Central and Subspace Clustering for Computer Vision Applications,”
Proc. 23rd Int'l Conf. Machine Learning (ICML), 2006.- [21] H.S. Nagesh, S. Goil, and A. Choudhary, “Adaptive Grids for Clustering Massive Data Sets,”
Proc. First IEEE Int'l Conf. Data Mining (ICDM), 2001.- [22] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, “UCI Repository of Machine Learning Databases,” http://www.ics.uci. edu/mlearnmlrepository.html , 1998.
- [23] C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, “A Monte Carlo Algorithm for Fast Projective Clustering,”
Proc. ACM SIGMOD, 2002.- [24] K.Y. Yip, D.W. Cheung, and M.K. Ng, “HARP: A Practical Projected Clustering Algorithm,”
IEEE Trans. Knowledge and Data Eng., 2004.- [25] M.L. Yiu and N. Mamoulis, “Iterative Projected Clustering by Subspace Mining,”
IEEE Trans. Knowledge and Data Eng., 2005. |