CSDL Home IEEE Transactions on Pattern Analysis & Machine Intelligence 2011 vol.33 Issue No.12 - December

Subscribe

Issue No.12 - December (2011 vol.33)

pp: 2396-2409

Natthakan Iam-On , Aberystwyth University, Aberystwyth

Tossapon Boongoen , Royal Thai Air Force Academy, Thailand

Simon Garrett , Aberystwyth University, Aberystwyth

Chris Price , Aberystwyth University, Aberystwyth

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPAMI.2011.84

ABSTRACT

Cluster ensembles have recently emerged as a powerful alternative to standard cluster analysis, aggregating several input data clusterings to generate a single output clustering, with improved robustness and stability. From the early work, these techniques held great promise; however, most of them generate the final solution based on incomplete information of a cluster ensemble. The underlying ensemble-information matrix reflects only cluster-data point relations, while those among clusters are generally overlooked. This paper presents a new link-based approach to improve the conventional matrix. It achieves this using the similarity between clusters that are estimated from a link network model of the ensemble. In particular, three new link-based algorithms are proposed for the underlying similarity assessment. The final clustering result is generated from the refined matrix using two different consensus functions of feature-based and graph-based partitioning. This approach is the first to address and explicitly employ the relationship between input partitions, which has not been emphasized by recent studies of matrix refinement. The effectiveness of the link-based approach is empirically demonstrated over 10 data sets (synthetic and real) and three benchmark evaluation measures. The results suggest the new approach is able to efficiently extract information embedded in the input clusterings, and regularly illustrate higher clustering quality in comparison to several state-of-the-art techniques.

INDEX TERMS

Clustering, cluster ensembles, cluster relations, link-based similarity, data mining.

CITATION

Natthakan Iam-On, Tossapon Boongoen, Simon Garrett, Chris Price, "A Link-Based Approach to the Cluster Ensemble Problem",

*IEEE Transactions on Pattern Analysis & Machine Intelligence*, vol.33, no. 12, pp. 2396-2409, December 2011, doi:10.1109/TPAMI.2011.84REFERENCES

- [1] L.A. Adamic and E. Adar, "Friends and Neighbors on the Web,"
Social Networks, vol. 25, no. 3, pp. 211-230, 2003.- [2] A. Asuncion and D.J. Newman "UCI Machine Learning Repository," http://www.ics.uci.edu/~mlearnMLRepository. html , , School of Information and Computer Science, Univ. of California Irvine, 2007.
- [3] H. Ayad and M. Kamel, "Finding Natural Clusters Using Multiclusterer Combiner Based on Shared Nearest Neighbors,"
Proc. Int'l Workshop Multiple Classifier Systems, pp. 166-175, 2003.- [4] T. Boongoen, Q. Shen, and C. Price, "Disclosing False Identity through Hybrid Link Analysis,"
Artificial Intelligence and Law, vol. 18, no. 1, pp. 77-102, 2010.- [5] C. Boulis and M. Ostendorf, "Combining Multiple Clustering Systems,"
Proc. European Conf. Principles and Practice of Knowledge Discovery in Databases, pp. 63-74, 2004.- [6] D. Cristofor and D. Simovici, "Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms,"
J. Universal Computer Science, vol. 8, no. 2, pp. 153-172, 2002.- [7] C. Domeniconi and M. Al-Razgan, "Weighted Cluster Ensembles: Methods and Analysis,"
ACM Trans. Knowledge Discovery from Data, vol. 2, no. 4, pp. 1-40, 2009.- [8] C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan, and D. Papadopoulos, "Locally Adaptive Metrics for Clustering High Dimensional Data,"
Data Mining and Knowledge Discovery, vol. 14, no. 1, pp. 63-97, 2007.- [9] R.O. Duda, P.E. Hart, and D.G. Stork,
Pattern Classification, second ed. Wiley-Interscience, Nov. 2000.- [10] S. Dudoit and J. Fridyand, "Bagging to Improve the Accuracy of a Clustering Procedure,"
Bioinformatics, vol. 19, no. 9, pp. 1090-1099, 2003.- [11] X.Z. Fern and C.E. Brodley, "Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach,"
Proc. Int'l Conf. Machine Learning, pp. 186-193, 2003.- [12] X.Z. Fern and C.E. Brodley, "Solving Cluster Ensemble Problems by Bipartite Graph Partitioning,"
Proc. Int'l Conf. Machine Learning, pp. 36-43, 2004.- [13] B. Fischer and J.M. Buhmann, "Bagging for Path-Based Clustering,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 11, pp. 1411-1415, Nov. 2003.- [14] F. Fouss, A. Pirotte, J.M. Renders, and M. Saerens, "Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation,"
IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 355-369, Mar. 2007.- [15] A.L.N. Fred, "Finding Consistent Clusters in Data Partitions,"
Proc. Second Int'l Workshop Multiple Classifier Systems, pp. 309-318, 2001.- [16] A.L.N. Fred and A.K. Jain, "Data Clustering Using Evidence Accumulation,"
Proc. Int'l Conf. Pattern Recognition, pp. 276-280, 2002.- [17] A.L.N. Fred and A.K. Jain, "Combining Multiple Clusterings Using Evidence Accumulation,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835-850, June 2005.- [18] L. Getoor and C.P. Diehl, "Link Mining: A Survey,"
ACM SIGKDD Explorations Newsletter, vol. 7, no. 2, pp. 3-12, 2005.- [19] A. Gionis, H. Mannila, and P. Tsaparas, "Clustering Aggregation,"
Proc. Int'l Conf. Data Eng., pp. 341-352, 2005.- [20] S.T. Hadjitodorov, L.I. Kuncheva, and L.P. Todorova, "Moderate Diversity for Better Cluster Ensembles,"
Information Fusion, vol. 7, no. 3, pp. 264-275, 2006.- [21] D.S. Hochbaum and D.B. Shmoys, "A Best Possible Heuristic for the k-Center Problem,"
Math. Operational Research, vol. 10, no. 2, pp. 180-184, 1985.- [22] X. Hu and I. Yoo, "Cluster Ensemble and Its Applications in Gene Expression Analysis,"
Proc. Asia-Pacific Bioinformatics Conf., pp. 297-302, 2004.- [23] N. Iam-On, T. Boongoen, and S. Garrett, "Refining Pairwise Similarity Matrix for Cluster Ensemble Problem with Cluster Relations,"
Proc. 11th Int'l Conf. Discovery Science, pp. 222-233, 2008.- [24] A.K. Jain, M.N. Murty, and P.J. Flynn, "Data Clustering: A Review,"
ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.- [25] G. Jeh and J. Widom, "SimRank: A Measure of Structural-Context Similarity,"
Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 538-543, 2002.- [26] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, "Multilevel Hypergraph Partitioning: Applications in VLSI Domain,"
IEEE Trans. Very Large Scale Integration Systems, vol. 7, no. 1, pp. 69-79, Mar. 1999.- [27] G. Karypis and V. Kumar, "Multilevel k-Way Partitioning Scheme for Irregular Graphs,"
J. Parallel Distributed Computing, vol. 48, no. 1, pp. 96-129, 1998.- [28] L. Kaufman and P.J. Rousseeuw,
Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Publishers, 1990.- [29] J. Kittler, M. Hatef, R. Duin, and J. Matas, "On Combining Classifiers,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226-239, Mar. 1998.- [30] L.I. Kuncheva and S.T. Hadjitodorov, "Using Diversity in Cluster Ensembles,"
Proc. IEEE Int'l Conf. Systems, Man, and Cybernetics, pp. 1214-1219, 2004.- [31] L.I. Kuncheva and D. Vetrov, "Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1798-1808, Nov. 2006.- [32] M. Law, A. Topchy, and A.K. Jain, "Multiobjective Data Clustering,"
Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 424-430, 2004.- [33] D. Liben-Nowell and J. Kleinberg, "The Link-Prediction Problem for Social Networks,"
J. Am. Soc. Information Science and Technology, vol. 58, no. 7, pp. 1019-1031, 2007.- [34] A. Likas, N. Vlassis, and J.J. Verbeek, "The Global k-Means Clustering Algorithm,"
Pattern Recognition, vol. 36, pp. 451-461, 2003.- [35] Z. Lin, I. King, and M.R. Lyu, "PageSim: A Novel Link-Based Similarity Measure for the World Wide Web,"
Proc. IEEE/WIC/ACM Int'l Conf. Web Intelligence, pp. 687-693, 2006.- [36] B. Minaei-Bidgoli, A. Topchy, and W. Punch, "A Comparison of Resampling Methods for Clustering Ensembles,"
Proc. Int'l Conf. Machine Learning: Models, Technologies, and Applications, pp. 939-945, 2004.- [37] E. Minkov, W.W. Cohen, and A.Y. Ng, "Contextual Search and Name Disambiguation in Email Using Graphs,"
Proc. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 27-34, 2006.- [38] S. Monti, P. Tamayo, J.P. Mesirov, and T.R. Golub, "Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data,"
Machine Learning, vol. 52, nos. 1/2, pp. 91-118, 2003.- [39] A. Ng, M. Jordan, and Y. Weiss, "On Spectral Clustering: Analysis and an Algorithm,"
Advances in Neural Information Processing Systems, vol. 14, pp. 849-856, 2001.- [40] N. Nguyen and R. Caruana, "Consensus Clusterings,"
Proc. IEEE Int'l Conf. Data Mining, pp. 607-612, 2007.- [41] K. Punera and J. Ghosh, "Soft Cluster Ensembles,"
Proc. Advances in Fuzzy Clustering and Its Applications, pp. 69-90, 2007.- [42] W.M. Rand, "Objective Criteria for the Evaluation of Clustering Methods,"
J. Am. Statistical Assoc., vol. 66, pp. 846-850, 1971.- [43] P. Reuther and B. Walter, "Survey on Test Collections and Techniques for Personal Name Matching,"
Int'l J. Metadata, Semantics and Ontologies, vol. 1, no. 2, pp. 89-99, 2006.- [44] A. Strehl and J. Ghosh, "Cluster Ensembles: A Knowledge Reuse Framework for Combining Multiple Partitions,"
J. Machine Learning Research, vol. 3, pp. 583-617, 2002.- [45] A. Struyf, M. Hubert, and P.J. Rousseeuw, "Integrating Robust Clustering Techniques in S-PLUS,"
Computational Statistics and Data Analysis, vol. 26, pp. 17-37, 1997.- [46] A.P. Topchy, A.K. Jain, and W.F. Punch, "Combining Multiple Weak Clusterings,"
Proc. IEEE Int'l Conf. Data Mining, pp. 331-338, 2003.- [47] A.P. Topchy, A.K. Jain, and W.F. Punch, "A Mixture Model for Clustering Ensembles,"
Proc. SIAM Int'l Conf. Data Mining, pp. 379-390, 2004.- [48] A.P. Topchy, A.K. Jain, and W.F. Punch, "Clustering Ensembles: Models of Consensus and Weak Partitions,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1866-1881, Dec. 2005.- [49] H. Xue, S. Chen, and Q. Yang, "Discriminatively Regularized Least-Squares Classification,"
Pattern Recognition, vol. 42, no. 1, pp. 93-104, 2009.- [50] Z. Yu, H-S. Wong, and H. Wang, "Graph-Based Consensus Clustering for Class Discovery from Gene Expression Data,"
Bioinformatics, vol. 23, no. 21, pp. 2888-2896, 2007. |