Subscribe

Issue No.03 - March (2012 vol.24)

pp: 413-425

Natthakan Iam-On , Mae Fah Luang University, Muang

Tossapon Boongoen , Royal Thai Air Force Academy, Saimai

Simon Garrett , Aispire Consulting Ltd., Aberystwyth

Chris Price , Aberystwyth University, Aberystwyth

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.268

ABSTRACT

Although attempts have been made to solve the problem of clustering categorical data via cluster ensembles, with the results being competitive to conventional algorithms, it is observed that these techniques unfortunately generate a final data partition based on incomplete information. The underlying ensemble-information matrix presents only cluster-data point relations, with many entries being left unknown. The paper presents an analysis that suggests this problem degrades the quality of the clustering result, and it presents a new link-based approach, which improves the conventional matrix by discovering unknown entries through similarity between clusters in an ensemble. In particular, an efficient link-based algorithm is proposed for the underlying similarity assessment. Afterward, to obtain the final clustering result, a graph partitioning technique is applied to a weighted bipartite graph that is formulated from the refined matrix. Experimental results on multiple real data sets suggest that the proposed link-based method almost always outperforms both conventional clustering algorithms for categorical data and well-known cluster ensemble techniques.

INDEX TERMS

Clustering, categorical data, cluster ensembles, link-based similarity, data mining.

CITATION

Natthakan Iam-On, Tossapon Boongoen, Simon Garrett, Chris Price, "A Link-Based Cluster Ensemble Approach for Categorical Data Clustering",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 3, pp. 413-425, March 2012, doi:10.1109/TKDE.2010.268REFERENCES

- [1] D.S. Hochbaum and D.B. Shmoys, "A Best Possible Heuristic for the K-Center Problem,"
Math. of Operational Research, vol. 10, no. 2, pp. 180-184, 1985.- [2] L. Kaufman and P.J. Rousseeuw,
Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Publishers, 1990.- [3] A.K. Jain and R.C. Dubes,
Algorithms for Clustering. Prentice-Hall, 1998.- [4] P. Zhang, X. Wang, and P.X. Song, "Clustering Categorical Data Based on Distance Vectors,"
The J. Am. Statistical Assoc., vol. 101, no. 473, pp. 355-367, 2006.- [5] J. Grambeier and A. Rudolph, "Techniques of Cluster Algorithms in Data Mining,"
Data Mining and Knowledge Discovery, vol. 6, pp. 303-360, 2002.- [6] K.C. Gowda and E. Diday, "Symbolic Clustering Using a New Dissimilarity Measure,"
Pattern Recognition, vol. 24, no. 6, pp. 567-578, 1991.- [7] J.C. Gower, "A General Coefficient of Similarity and Some of Its Properties,"
Biometrics, vol. 27, pp. 857-871, 1971.- [8] Z. Huang, "Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values,"
Data Mining and Knowledge Discovery, vol. 2, pp. 283-304, 1998.- [9] Z. He, X. Xu, and S. Deng, "Squeezer: An Efficient Algorithm for Clustering Categorical Data,"
J. Computer Science and Technology, vol. 17, no. 5, pp. 611-624, 2002.- [10] P. Andritsos and V. Tzerpos, "Information-Theoretic Software Clustering,"
IEEE Trans. Software Eng., vol. 31, no. 2, pp. 150-165, Feb. 2005.- [11] D. Cristofor and D. Simovici, "Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms,"
J. Universal Computer Science, vol. 8, no. 2, pp. 153-172, 2002.- [12] D.H. Fisher, "Knowledge Acquisition via Incremental Conceptual Clustering,"
Machine Learning, vol. 2, pp. 139-172, 1987.- [13] D. Gibson, J. Kleinberg, and P. Raghavan, "Clustering Categorical Data: An Approach Based on Dynamical Systems,"
VLDB J., vol. 8, nos. 3-4, pp. 222-236, 2000.- [14] S. Guha, R. Rastogi, and K. Shim, "ROCK: A Robust Clustering Algorithm for Categorical Attributes,"
Information Systems, vol. 25, no. 5, pp. 345-366, 2000.- [15] M.J. Zaki and M. Peters, "Clicks: Mining Subspace Clusters in Categorical Data via Kpartite Maximal Cliques,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 355-356, 2005.- [16] V. Ganti, J. Gehrke, and R. Ramakrishnan, "CACTUS: Clustering Categorical Data Using Summaries,"
Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 73-83, 1999.- [17] D. Barbara, Y. Li, and J. Couto, "COOLCAT: An Entropy-Based Algorithm for Categorical Clustering,"
Proc. Int'l Conf. Information and Knowledge Management (CIKM), pp. 582-589, 2002.- [18] Y. Yang, S. Guan, and J. You, "CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data,"
Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 682-687, 2002.- [19] D.H. Wolpert and W.G. Macready, "No Free Lunch Theorems for Search," Technical Report SFI-TR-95-02-010, Santa Fe Inst., 1995.
- [20] L.I. Kuncheva and S.T. Hadjitodorov, "Using Diversity in Cluster Ensembles,"
Proc. IEEE Int'l Conf. Systems, Man and Cybernetics, pp. 1214-1219, 2004.- [21] H. Xue, S. Chen, and Q. Yang, "Discriminatively Regularized Least-Squares Classification,"
Pattern Recognition, vol. 42, no. 1, pp. 93-104, 2009.- [22] A. Gionis, H. Mannila, and P. Tsaparas, "Clustering Aggregation,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 341-352, 2005.- [23] N. Nguyen and R. Caruana, "Consensus Clusterings,"
Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 607-612, 2007.- [24] A.P. Topchy, A.K. Jain, and W.F. Punch, "Clustering Ensembles: Models of Consensus and Weak Partitions,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1866-1881, Dec. 2005.- [25] C. Boulis and M. Ostendorf, "Combining Multiple Clustering Systems,"
Proc. European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 63-74, 2004.- [26] B. Fischer and J.M. Buhmann, "Bagging for Path-Based Clustering,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 11, pp. 1411-1415, Nov. 2003.- [27] C. Domeniconi and M. Al-Razgan, "Weighted Cluster Ensembles: Methods and Analysis,"
ACM Trans. Knowledge Discovery from Data, vol. 2, no. 4, pp. 1-40, 2009.- [28] X.Z. Fern and C.E. Brodley, "Solving Cluster Ensemble Problems by Bipartite Graph Partitioning,"
Proc. Int'l Conf. Machine Learning (ICML), pp. 36-43, 2004.- [29] A. Strehl and J. Ghosh, "Cluster Ensembles: A Knowledge Reuse Framework for Combining Multiple Partitions,"
J. Machine Learning Research, vol. 3, pp. 583-617, 2002.- [30] H. Ayad and M. Kamel, "Finding Natural Clusters Using Multiclusterer Combiner Based on Shared Nearest Neighbors,"
Proc. Int'l Workshop Multiple Classifier Systems, pp. 166-175, 2003.- [31] A.L.N. Fred and A.K. Jain, "Combining Multiple Clusterings Using Evidence Accumulation,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835-850, June 2005.- [32] S. Monti, P. Tamayo, J.P. Mesirov, and T.R. Golub, "Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data,"
Machine Learning, vol. 52, nos. 1/2, pp. 91-118, 2003.- [33] N. Iam-On, T. Boongoen, and S. Garrett, "Refining Pairwise Similarity Matrix for Cluster Ensemble Problem with Cluster Relations,"
Proc. Int'l Conf. Discovery Science, pp. 222-233, 2008.- [34] T. Boongoen, Q. Shen, and C. Price, "Disclosing False Identity through Hybrid Link Analysis,"
Artificial Intelligence and Law, vol. 18, no. 1, pp. 77-102, 2010.- [35] L. Getoor and C.P. Diehl, "Link Mining: A Survey,"
ACM SIGKDD Explorations Newsletter, vol. 7, no. 2, pp. 3-12, 2005.- [36] D. Liben-Nowell and J. Kleinberg, "The Link-Prediction Problem for Social Networks,"
J. Am. Soc. for Information Science and Technology, vol. 58, no. 7, pp. 1019-1031, 2007.- [37] J. Kittler, M. Hatef, R. Duin, and J. Matas, "On Combining Classifiers,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226-239, Mar. 1998.- [38] L.I. Kuncheva and D. Vetrov, "Evaluation of Stability of K-Means Cluster Ensembles with Respect to Random Initialization,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1798-1808, Nov. 2006.- [39] A.P. Topchy, A.K. Jain, and W.F. Punch, "A Mixture Model for Clustering Ensembles,"
Proc. SIAM Int'l Conf. Data Mining, pp. 379-390, 2004.- [40] X.Z. Fern and C.E. Brodley, "Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach,"
Proc. Int'l Conf. Machine Learning (ICML), pp. 186-193, 2003.- [41] Z. Yu, H.-S. Wong, and H. Wang, "Graph-Based Consensus Clustering for Class Discovery from Gene Expression Data,"
Bioinformatics, vol. 23, no. 21, pp. 2888-2896, 2007.- [42] S. Dudoit and J. Fridyand, "Bagging to Improve the Accuracy of a Clustering Procedure,"
Bioinformatics, vol. 19, no. 9, pp. 1090-1099, 2003.- [43] B. Minaei-Bidgoli, A. Topchy, and W. Punch, "A Comparison of Resampling Methods for Clustering Ensembles,"
Proc. Int'l Conf. Artificial Intelligence, pp. 939-945, 2004.- [44] X. Hu and I. Yoo, "Cluster Ensemble and Its Applications in Gene Expression Analysis,"
Proc. Asia-Pacific Bioinformatics Conf., pp. 297-302, 2004.- [45] M. Law, A. Topchy, and A.K. Jain, "Multiobjective Data Clustering,"
Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 424-430, 2004.- [46] G. Karypis and V. Kumar, "Multilevel K-Way Partitioning Scheme for Irregular Graphs,"
J. Parallel Distributed Computing, vol. 48, no. 1, pp. 96-129, 1998.- [47] A. Ng, M. Jordan, and Y. Weiss, "On Spectral Clustering: Analysis and an Algorithm,"
Advances in Neural Information Processing Systems, vol. 14, pp. 849-856, 2001.- [48] M. Al-Razgan, C. Domeniconi, and D. Barbara, "Random Subspace Ensembles for Clustering Categorical Data,"
Supervised and Unsupervised Ensemble Methods and Their Applications, pp. 31-48, Springer, 2008.- [49] Z. He, X. Xu, and S. Deng, "A Cluster Ensemble Method for Clustering Categorical Data,"
Information Fusion, vol. 6, no. 2, pp. 143-151, 2005.- [50] R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules between Sets of Items in Large Databases,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 207-216, 1993.- [51] P.N. Tan, M. Steinbach, and V. Kumar,
Introduction to Data Mining. Addison Wesley, 2005.- [52] G. Jeh and J. Widom, "Simrank: A Measure of Structural-Context Similarity,"
Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 538-543, 2002.- [53] F. Fouss, A. Pirotte, J.M. Renders, and M. Saerens, "Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation,"
IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 355-369, Mar. 2007.- [54] E. Minkov, W.W. Cohen, and A.Y. Ng, "Contextual Search and Name Disambiguation in Email Using Graphs,"
Proc. Int'l Conf. Research and Development in IR, pp. 27-34, 2006.- [55] P. Reuther and B. Walter, "Survey on Test Collections and Techniques for Personal Name Matching,"
Int'l J. Metadata, Semantics and Ontologies, vol. 1, no. 2, pp. 89-99, 2006.- [56] L.A. Adamic and E. Adar, "Friends and Neighbors on the Web,"
Social Networks, vol. 25, no. 3, pp. 211-230, 2003.- [57] U. Luxburg, "A Tutorial on Spectral Clustering,"
Statistics and Computing, vol. 17, no. 4, pp. 395-416, 2007.- [58] A. Asuncion and D.J. Newman, "UCI Machine Learning Repository," School of Information and Computer Science, Univ. of California, http://www.ics.uci.edu/~mlearnMLRepository. html , 2007.
- [59] L. Hubert and P. Arabie, "Comparing Partitions,"
J. Classification, vol. 2, no. 1, pp. 193-218, 1985.- [60] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, "Multilevel Hypergraph Partitioning: Applications in VLSI Domain,"
IEEE Trans. Very Large Scale Integration Systems, vol. 7, no. 1, pp. 69-79, Mar. 1999.- [61] G. Das and H. Mannila, "Context-Based Similarity Methods for Categorical Attributes,"
Proc. Principles of Data Mining and Knowledge. Discovery (PKDD), pp. 201-211, 2000.- [62] G. Das, H. Mannila, and P. Ronkainen, "Similarity of Attributes by External Probes,"
Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 16-22, 1998.- [63] Y. Zhang, A. Fu, C. Cai, and P. Heng, "Clustering Categorical Data,"
Proc. Int'l Conf. Data Eng. (ICDE), p. 305, 2000.- [64] M. Dutta, A.K. Mahanta, and A.K. Pujari, "QROCK: A Quick Version of the ROCK Algorithm for Clustering of Categorical Data,"
Pattern Recognition Letters, vol. 26, pp. 2364-2373, 2005.- [65] E. Abdu and D. Salane, "A Spectral-Based Clustering Algorithm for Categorical Data Using Data Summaries,"
Proc. Workshop Data Mining using Matrices and Tensors, pp. 1-8, 2009.- [66] B. Mirkin, "Reinterpreting the Category Utility Function,"
Machine Learning, vol. 45, pp. 219-228, 2001. |