This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Clustering Ensembles: Models of Consensus and Weak Partitions
December 2005 (vol. 27 no. 12)
pp. 1866-1881
Clustering ensembles have emerged as a powerful method for improving both the robustness as well as the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial, or statistical perspectives. This study extends previous research on clustering ensembles in several respects. First, we introduce a unified representation for multiple clusterings and formulate the corresponding categorical clustering problem. Second, we propose a probabilistic model of consensus using a finite mixture of multinomial distributions in a space of clusterings. A combined partition is found as a solution to the corresponding maximum-likelihood problem using the EM algorithm. Third, we define a new consensus function that is related to the classical intraclass variance criterion using the generalized mutual information definition. Finally, we demonstrate the efficacy of combining partitions generated by weak clustering algorithms that use data projections and random data splits. A simple explanatory model is offered for the behavior of combinations of such weak clustering components. Combination accuracy is analyzed as a function of several parameters that control the power and resolution of component partitions as well as the number of partitions. We also analyze clustering ensembles with incomplete information and the effect of missing cluster labels on the quality of overall consensus. Experimental results demonstrate the effectiveness of the proposed methods on several real-world data sets.

[1] E.E. Abola, F.C. Bernstein, S.H. Bryant, T.F. Koetzle, and J. Weng, “Protein Data Bank,” Crystallographic Databases-Information Content, Software Systems, Scientific Applications, F. Allen et al., eds. pp. 107-132, Bonn/Cambridge/Chester: Data Commission of the Int'l Union of Crystallography, 1987.
[2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” Proc. ACM-SIGMOD 1998 Int'l Conf. Management of Data, pp. 94-105, 1998.
[3] J.-P. Barthelemy and B. Leclerc, “The Median Procedure for Partition,” Partitioning Data Sets, AMS DIMACS Series in Discrete Math., I.J. Cox et al., eds., vol. 19, pp. 3-34, 1995.
[4] D.J. Bartholomew and M. Knott, “Latent Variable Models and Factor Analysis,” Kendall's Library of Statistics 7, second ed. London: Ar nold, 1999.
[5] A. Ben-Hur, D. Horn, H.T. Siegelmann, and V. Vapnik, “Support Vector Clustering,” J. Machine Learning Research, vol. 2, pp. 125-137, 2001.
[6] L. Breiman, “Arcing Classifiers,” The Annals of Statistics, vol. 26, no. 3, pp. 801-849 1998
[7] E. Dimitriadou, A. Weingessel, and K. Hornik, “Voting-Merging: An Ensemble Method for Clustering,” Proc. In'l Conf. Artificial Neural Networks, pp. 217-224, 2001.
[8] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., B, vol. 39, pp. 1-22, 199.
[9] P. Domingos and M. Pazzani, “On the Optimality of the Simple Bayesian Classifier under Zero-One Loss,” Machine Learning, vol. 29, pp. 103-130, 1997.
[10] S. Dudoit and J. Fridlyand, “Bagging to Improve the Accuracy of a Clustering Procedure,” Bioinformatics, vol. 19, no. 9, pp. 1090-1099, 2003.
[11] X.Z. Fern and C.E. Brodley, “Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach,” Proc. 20th Int'l Conf. Machine Learning, 2003.
[12] B. Fischer and J.M. Buhmann, “Path-Based Clustering for Grouping of Smooth Curves and Texture Segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 4, pp. 513-518, Apr. 2003.
[13] M. Figueiredo and A.K. Jain, “Unsupervised Learning of Finite Mixture Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, pp. 381-396, 2002.
[14] A.L.N. Fred, “Finding Consistent Clusters in Data Partitions,” Proc. Third Int'l Workshop Multiple Classifier Systems, F. Roli and J. Kittler, eds., pp. 309-318, 2001.
[15] A.L.N. Fred and A.K. Jain, “Data Clustering Using Evidence Accumulation,” Proc. 16th Int'l Conf. Pattern Recognition, pp. 276-280, 2002.
[16] A.L.N. Fred and A.K Jain, “Evidence Accumulation Clustering Based on the K-Means Algorithm,” Proc. Structural, Syntactic, and Statistical Pattern Recognition, T. Caelli et al., eds., pp. 442-451, 2002.
[17] A. Fred and A.K. Jain, “Robust Data Clustering,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, June 2003.
[18] Y. Freund and R.E. Schapire, “Experiments with a New Boosting Algorithm,” Proc. 13th Int'l Conf. Machine Learning, pp. 148-156, 1996.
[19] W. Gablentz, M. Köppen, and E. Dimitriadou, “Robust Clustering by Evolutionary Computation,” Proc. Fifth Online World Conf. Soft Computing in Industrial Applications (WSC5), 2000.
[20] Z. Ghahramani and M. Jordan, “Supervised Learning from Incomplete Data via an EM Approach,” Proc. Advances in Neural Information Processing Systems, pp. 120-127, 1993.
[21] M.A. Gluck and J.E. Corter, “Information, Uncertainty, and the Utility of Categories,” Proc. Seventh Ann. Conf. Cognitive Science Soc., pp. 283-287, 1985.
[22] L.A. Goodman and W.H. Kruskal, “Measures of Association for Cross Classifications,” J. Am. Statistical Assoc., vol. 49, pp. 732-64, 1954.
[23] J. Havrda and F. Charvát, “Quantification Method of Classification Processes: Concept of Structrual a-Entropy,” Kybernetika, vol. 3, pp. 30-35, 1967.
[24] A. Hinneburg, D.A. Keim, and M. Wawryniuk, “Using Projections to Visually Cluster High-Dimensional Data,” Computing in Science and Eng., vol. 5, no. 2, pp. 12-25, 2003.
[25] T.K. Ho, “The Random Subspace Method for Constructing Decision Forests,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 832-844, Aug. 1998.
[26] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice-Hall, 1988.
[27] A.K. Jain, M.N. Murty, and P. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[28] E. Johnson and H. Kargupta, “Collective, Hierarchical Clustering from Distributed, Heterogeneous Data,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, M. Zaki and C. Ho, eds., pp. 221-244, 1999.
[29] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, “Multilevel Hypergraph Partitioning: Applications in VLSI Design,” Proc. ACM/IEEE Design Automation Conf., pp. 526-529, 1997.
[30] G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs,” SIAM J. Scientific Computing, vol. 20, no. 1, pp. 359-392, 1998.
[31] P. Kellam, X. Liu, N.J. Martin, C. Orengo, S. Swift, and A. Tucker, “Comparing, Contrasting and Combining Clusters in Viral Gene Expression Data,” Proc. Sixth Workshop Intelligent Data Analysis in Medicine and Pharmacology, pp. 56-62, 2001.
[32] E.M. Kleinberg, “Stochastic Discrimination,” Annals of Math. and Artificial Intelligence, vol. 1, pp. 207-239, 1990.
[33] J. Kleinberg, “An Impossibility Theorem for Clustering,” Proc. Advances in Neural Information Processing Systems, vol. 15, 2002.
[34] P. Langley, W. Iba, and K. Thompson, “An Analysis of Bayesian Classifiers,” Proc. 10th Nat'l Conf. Artificial Intelligence, pp. 399-406, 1992.
[35] F. Leisch, “Bagged Clustering,” Working Papers SFB Adaptive Information Systems and Modeling in Economics and Management Science, Institut für Information, Abt. Produktionsmanagement, Wien, Wirtschaftsuniv, no. 51, Aug. 1999.
[36] S. Monti, P. Tamayo, J. Mesirov, and T. Golub, “Consensus Clustering: A Resamlping Based Method for Class Discovery and Visualization of Gene Expression Microarray Data,” J. Machine Learning, vol. 52, nos. 1-2, 2003.
[37] R.S. Michalski and R.E Stepp, “Automated Construction of Classifications: Conceptual Clustering versus Numerical Taxonomy,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 5, pp. 396-410, 1983.
[38] B. Minaei, A. Topchy, and W. Punch, “Ensembles of Partitions via Data Resampling,” Proc. Int'l Conf. Information Technology, vol. 2, pp. 188-192, 2004.
[39] B. Mirkin, “Reinterpreting the Category Utility Function,” Machine Learning, vol. 45, no. 2, pp. 219-228, 2001.
[40] S.C. Odewahn, E.B. Stockwell, R.L. Pennington, R.M. Humphreys, and W.A. Zumach, “Automated Star/Galaxy Discrimination with Neural Networks,” Astronomical J., vol. 103, pp. 308-331, 1992.
[41] B.H. Park and H. Kargupta, “Distributed Data Mining,” The Handbook of Data Mining, N. Ye, ed. Lawrence Erlbaum Assoc., 2003.
[42] C. Procopiuc, M. Jones, P. Agarwal, and T.M. Murali, “A Monte Carlo Algorithm for Fast Projective Clustering,” Proc. 2002 SIGMOD Conf., pp. 418-427, 2002.
[43] Y. Qian and C. Suen, “Clustering Combination Method,” Proc. Int'l Conf. Pattern Recognition, vol. 2, 2000.
[44] J.R. Quinlan, “Bagging, Boosting, and C4. 5,” Proc. 13th AAAI Conf Artificial Intelligence, pp. 725-30, 1996.
[45] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain, “Dimensionality Reduction Using Genetic Algorithms,” IEEE Trans. Evolutionary Computation, vol. 4, no. 2, pp. 164-171, 2000.
[46] D.B. Rubin, “Inference with Missing Data,” Biometrika, vol. 63, pp. 581-592, 1976.
[47] A. Strehl and J. Ghosh, “Cluster Ensembles— A Knowledge Reuse Framework for Combining Multiple Partitions,” J. Machine Learning Research, vol. 3, pp. 583-617, 2002.
[48] D.H. Fisher, “Knowledge Acquisition via Incremental Conceptual Clustering,” Machine Learning, vol. 2, pp. 139-172, 1987.
[49] C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, 1982.

Index Terms:
Index Terms- Clustering, ensembles, multiple classifier systems, consensus function, mutual information.
Citation:
Alexander Topchy, Anil K. Jain, William Punch, "Clustering Ensembles: Models of Consensus and Weak Partitions," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1866-1881, Dec. 2005, doi:10.1109/TPAMI.2005.237
Usage of this product signifies your acceptance of the Terms of Use.