This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Comparing Subspace Clusterings
July 2006 (vol. 18 no. 7)
pp. 902-916
We present the first framework for comparing subspace clusterings. We propose several distance measures for subspace clusterings, including generalizations of well-known distance measures for ordinary clusterings. We describe a set of important properties for any measure for comparing subspace clusterings and give a systematic comparison of our proposed measures in terms of these properties. We validate the usefulness of our subspace clustering distance measures by comparing clusterings produced by the algorithms FastDOC, HARP, PROCLUS, ORCLUS, and SSPC. We show that our distance measures can be also used to compare partial clusterings, overlapping clusterings, and patterns in binary data matrices.

[1] C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and J. Park, “A Framework for Finding Projected Clusters in High Dimensional Spaces,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 1999.
[2] C.C. Aggarwal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 70-81, 2000, citeseer.nj.nec.comaggarwal00finding.html .
[3] K.Y. Yip, D.W. Cheung, and M.K. Ng, “HARP: A Practical Projected Clustering Algorithm,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004.
[4] K.Y. Yip, D.W. Cheung, and M.K. Ng, “On Discovery of Extremely Low-Dimensional Clusters Using Semi-Supervised Projected Clustering,” Proc. 21st Int'l Conf. Data Eng., 2005.
[5] P.K. Agarwal and N.H. Mustafa, “K-Means Projective Clustering,” Proc. 23rd ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, pp. 155-165, 2004.
[6] C.M. Procopiuc, M.T. Jones, P.K. Agarwal, and T.M. Murali, “A Monte Carlo Algorithm for Fast Projective Clustering,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.
[7] Y. Kluger, R. Basri, J.T. Chang, and M. Gerstein, “Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions,” Genome Research, vol. 13, no. 4, pp. 703-716, Apr. 2003.
[8] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D.S. Modha, “A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximation,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2004.
[9] H. Cho, I.S. Dhillon, Y. Guan, and S. Sra, “Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data,” Proc. Fourth SIAM Int'l Conf. Data Mining, 2004.
[10] I.S. Dhillon, S. Mallela, and D.S. Modha, “Information-Theoretic Co-Clustering,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2003.
[11] G. Getz, E. Levine, and E. Domany, “Coupled Two-Way Clustering Analysis of Gene Microarray Data,” Proc. Nat'l Academy of Sciences, vol. 97, pp. 12079-12084, 2000.
[12] K. Pollard and M.J. van der Laan, “Statistical Inference for Simultaneous Clustering of Gene Expression Data,” Math. Biosciences, vol. 176, no. 1, pp. 99-121, 2002.
[13] A. Hartigan, “Direct Clustering of a Data Matrix,” J. Am. Statistical Assoc., vol. 67, no. 337, pp. 123-129, 1972.
[14] J.H. Friedman and J.J. Meulman, “Clustering Objects on Subsets of Attributes,” J. Royal Statistical Soc. B, vol. 66, pp. 1-25, 2004.
[15] E. Oja and J. Parkkinen, “On Subspace Clustering,” Proc. Seventh Int'l Conf. Pattern Recognition, 1984.
[16] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 94-105, 1998, .
[17] C. Böhm, K. Kailing, P. Kröger, and A. Zimek, “Computing Clusters of Correlation Connected Objects,” Proc. 2004 ACM SIGMOD Int'l Conf. Management of Data, pp. 455-466, 2004.
[18] C.H. Cheng, A.W.-C. Fu, and Y. Zhang, “Entropy-Based Subspace Clustering for Mining Numerical Data,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 84-93, 1999, citeseer.nj.nec.com/agrawal98automatic.htmlciteseer.nj.nec.com/ articlecheng99entropybased. html .
[19] Y. Cheng and G.M. Church, “Biclustering of Expression Data,” Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology, 2000.
[20] I.S. Dhillon, S. Mallela, and D.S. Modha, “Information-Theoretic Co-Clustering,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2003.
[21] C. Domeniconi, D. Papadopoulos, D. Gunopulos, and S. Ma, “Subspace Clustering of High Dimensional Data,” Proc. Fourth SIAM Int'l Conf. Data Mining, 2004.
[22] K. Kailing, H.-P. Kriegel, and P. Kröger, “Density-Connected Subspace Clustering for High-Dimensional Data,” Proc. Fourth SIAM Int'l Conf. Data Mining, pp. 246-257, 2004.
[23] J. Liu, W. Wang, and J. Yang, “A Framework for Ontology-Driven Subspace Clustering,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2004.
[24] A.A. Melkman and E. Shaham, “Sleeved Coclustering,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2004.
[25] H. Nagesh, S. Goil, and A. Choudhary, “MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets,” Technical Report 9906-010, Northwestern Univ., 1999.
[26] A. Patrikainen and H. Mannila, “Subspace Clustering of High Dimensional Binary Data— A Probabilistic Approach,” Proc. Fourth SIAM Int'l Conf. Data Mining, Workshop Clustering High Dimensional Data and Its Applications, 2004.
[27] A. Tanay, R. Sharan, and R. Shamir, “Discovering Statistically Significant Biclusters in Gene Expression Data,” Bioinformatics, vol. 18, pp. 136-144, 2002.
[28] J. Yang, W. Wang, H. Wang, and P.S. Yu, “Delta-Cluster: Capturing Subspace Correlation in a Large Data Set,” Proc. 18th IEEE Int'l Conf. Data Eng., pp. 517-528, 2002, citeseer.ist.psu.edu566513.html.
[29] L. Parsons, E. Haque, and H. Liu, “Evaluating Subspace Clustering Algorithms,” Proc. Fourth SIAM Int'l Conf. Data Mining, Workshop Clustering High Dimensional Data and Its Applications, 2004.
[30] L. Parsons, E. Haque, and H. Liu, “Subspace Clustering for High Dimensional Data: A Review,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, 2004.
[31] K.Y. Yip, M.K. Ng, and D.W. Cheung, “A Review on Projected Clustering Algorithms,” Int'l J. Applied Math., vol. 13, pp. 35-47, 2003.
[32] S. Theodoridis and K. Koutroumbas, Pattern Recognition. Academic Press, 1999.
[33] M. Meila, “Comparing Clusterings by the Variation of Information,” Proc. 16th Ann. Conf. Computational Learning Theory, pp. 173-187, 2003.
[34] W.M. Rand, “Objective Criteria for the Evaluation of Clustering Methods,” J. Am. Statistical Assoc., vol. 66, pp. 846-850, 1971.
[35] B. Mirkin, Mathematical Classification and Clustering. Kluwer Academic Press, 1996.
[36] D.L. Wallace, “Comment,” J. Am. Statistical Assoc., vol. 78, no. 383, pp. 569-576, 1983.
[37] T. Lange, V. Roth, M.L. Braun, and J.M. Buhmann, “Stability-Based Validation of Clustering Solutions,” Neural Computation, vol. 16, pp. 1299-1323, 2004.
[38] S. Dudoit and J. Fridlyand, “A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Dataset,” Genome Biology, vol. 3, no. 7, 2002.
[39] E. Levine and E. Domany, “Resampling Method for Unsupervised Estimation of Cluster Validity,” Neural Computation, vol. 13, pp. 2573-2593, 2001.
[40] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering Aggregation,” Proc. 21st Int'l Conf. Data Eng., 2005.
[41] A. Strehl and J. Ghosh, “Cluster Ensembles— A Knowledge Reuse Framework for Combining Multiple Partititons,” J. Machine Learning Research, vol. 3, pp. 583-617, 2002.
[42] A. Topchy, A. Jain, and W. Punch, “A Mixture Model for Clustering Ensembles,” Proc. Fourth SIAM Int'l Conf. Data Mining, 2004.
[43] A. Topchy, M.H. Law, A.K. Jain, and A. Fred, “Analysis of Consensus Partition in Cluster Ensemble,” Proc. Fourth IEEE Int'l Conf. Data Mining, 2004.
[44] A. Topchy, B. Minaei-Bidgoli, A. Jain, and W. Punch, “Adaptive Clustering Ensembles,” Proc. 17th Int'l Conf. Pattern Recognition, pp. 272-275, 2004.
[45] P. Artigas, A. Goldenberg, A. Likhodedov, and R. Caruana, “Meta Clustering,” http://www-2.cs.cmu.edu/~artigas/classproj mlproj.ps, 2000.
[46] A.K. Jain, A. Topchy, M.H. Law, and J. Buhmann, “Landscape of Clustering Algorithms,” Proc. 17th Int'l Conf. Pattern Recognition, pp. 260-263, 2004.
[47] C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization, Algorithms and Complexity. Prentice-Hall, 1982.
[48] E.B. Fowlkes and C.L. Mallows, “A Method for Comparing Two Hierarchical Clusterings,” J. Am. Statistical Assoc., vol. 78, no. 383, pp. 553-569, 1983.
[49] P. Jaccard, “The Distribution of Flora in the Alpine Zone,” The New Phytologist, vol. 11, no. 2, pp. 37-50, 1912.
[50] M. Meila, “Comparing Clusterings— An Axiomatic View,” Proc. 22nd Int'l Conf. Machine Learning (ICML '05), L.D. Raedt and S. Wrobel, eds., vol. 22, 2005.
[51] C.V. Rijsbergen, Information Retrieval, second ed. Butterworths, 1979.
[52] A. Patrikainen and M. Meila, “Comparing Subspace Clusterings,” Technical Report UW-CSE-2004-10-1, Univ. Washington, 2004.
[53] M.E. Argentati, “Principal Angles between Subspaces,” http://www-math.cudenver.edu/~aknyazev/teaching/ ricotalk_defense.pdf, 2006.
[54] A. Björk and G.H. Golub, “Numerical Methods for Computing Angles between Linear Subspaces,” Math. Computation, vol. 27, pp. 579-594, 1973.
[55] Z. Drmac, “On Principal Angles between Subspaces of Euclidean Space,” Siam J. Matrix Analysis Applications, vol. 22, no. 1, pp. 173-194, 2000.
[56] K.Y. Yip, “HARP: A Practical Projected Clustering Algorithm for Mining Gene Expression Data,” master's thesis, The University of Hong Kong, Pokfulam Road, Hong Kong, http://www.csis. hku.hk/~ylyip/papersthesis.pdf , 2004.
[57] C. Yang, U. Fayyad, and P.S. Bradley, “Efficient Discovery of Error-Tolerant Frequent Itemsets in High Dimensions,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2000.
[58] J.K. Seppänen and H. Mannila, “Dense Itemsets,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2004.
[59] A. Gionis, H. Mannila, and J.K. Seppänen, “Geometric and Combinatorial Tiles in 0-1 Data,” Proc. European Conf. Principles and Practice of Knowledge Dicovery in Databases, 2004.
[60] J. Besson, C. Robardet, and J.-F. Boulicaut, “Mining Alpha-Beta Concepts as Relevant Bi-Sets from Transactional Data,” Proc. Third Int'l Workshop Knowledge Discovery in Inductive Databases (KDID '04), 2004.
[61] N. Mishra, D. Ron, and R. Swaminathan, “A New Conceptual Clustering Framework,” Machine Learning, vol. 56, nos. 1-3, pp. 115-151, 2004.
[62] A. Kaban, E. Bingham, and T. Hirsimäki, “Learning to Read between the Lines: The Aspect Bernoulli Model,” Proc. Fourth SIAM Int'l Conf. Data Mining, pp. 462-466, 2004.
[63] A. Ben-Hur, A. Elisseeff, and I. Guyon, “A Stability Based Method for Discovering Structure in Clustered Data,” Proc. Pacific Symp. Biocomputing, pp. 6-17, 2002.

Index Terms:
Subspace clustering, projected clustering, distance, feature selection, cluster validation.
Citation:
Anne Patrikainen, Marina Meila, "Comparing Subspace Clusterings," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 7, pp. 902-916, July 2006, doi:10.1109/TKDE.2006.106
Usage of this product signifies your acceptance of the Terms of Use.