Subscribe
Issue No.06 - June (2013 vol.35)
pp: 1509-1522
Liang Bai , Shaxi University, Shanxi and City University of Hong Kong, Hong Kong
Jiye Liang , Shaxi University, Shanxi
Chuangyin Dang , City University of Hong Kong, Hong Kong
Fuyuan Cao , Shaxi University, Shanxi
ABSTRACT
As a leading partitional clustering technique, $(k)$-modes is one of the most computationally efficient clustering methods for categorical data. In the $(k)$-modes, a cluster is represented by a &#x201C;mode,&#x201D; which is composed of the attribute value that occurs most frequently in each attribute domain of the cluster, whereas, in real applications, using only one attribute value in each attribute to represent a cluster may not be adequate as it could in turn affect the accuracy of data analysis. To get rid of this deficiency, several modified clustering algorithms were developed by assigning appropriate weights to several attribute values in each attribute. Although these modified algorithms are quite effective, their convergence proofs are lacking. In this paper, we analyze their convergence property and prove that they cannot guarantee to converge under their optimization frameworks unless they degrade to the original $(k)$--modes type algorithms. Furthermore, we propose two different modified algorithms with weighted cluster prototypes to overcome the shortcomings of these existing algorithms. We rigorously derive updating formulas for the proposed algorithms and prove the convergence of the proposed algorithms. The experimental studies show that the proposed algorithms are effective and efficient for large categorical datasets.
INDEX TERMS
Clustering algorithms, Prototypes, Algorithm design and analysis, Convergence, Optimization, Linear programming, Frequency measurement,convergence, Clustering, $(K)$-modes type clustering algorithms, categorical data, weighted cluster prototype
CITATION
Liang Bai, Jiye Liang, Chuangyin Dang, Fuyuan Cao, "The impact of cluster representatives on the convergence of the K-modes type clustering", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.35, no. 6, pp. 1509-1522, June 2013, doi:10.1109/TPAMI.2012.228
REFERENCES
 [1] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988. [2] N. Wrigley, Categorical Data Analysis for Geographers and Environmental Scientists. Longman, 1985. [3] C.C. Aggarwal, C. Magdalena, and P.S. Yu, "Finding Localized Associations in Market Basket Data," IEEE Trans. Knowledge and Data Eng. vol. 14, no. 1, pp. 51-62, Jan./Feb. 2002. [4] A. Baxevanis, Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, second ed. Wiley, 2001. [5] D. Barbara et al., Applications of Data Mining in Computer Security. Kluwer, 2002. [6] K.C. Gowda and E. Diday, "Symbolic Clustering Using a New Dissimilarity Measure," Pattern Recognition, vol. 24, no. 6, pp. 567-578, 1991. [7] D.H. Fisher, "Knowledge Acquisition via Incremental Conceptual Clustering," Machine Learning, vol. 2, no. 2, pp. 139-172, 1987. [8] S. Guha, R. Rastogi, and S. Kyuseok, "ROCK: A Robust Clustering Algorithm for Categorical Attributes," Proc. 15th Int'l Conf. Data Eng., vol. 23-26, pp. 512-521, 1999. [9] V. Ganti, J.E. Gekhre, and R. Ramakrishnan, "CACTUS-Clustering Categorical Data Using Summaries," Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 73-83, 1999. [10] D. Barbara, Y. Li, and J. Couto, "Coolcat: An Entropy-Based Algorithm for Categorical Clustering," Proc. 11th Int'l Conf. Information and Knowledge Management, pp. 582-589, 2002. [11] P. Andritsos, P. Tsaparas, R. Miller, and K. Sevcik, "LIMBO: Scalable Clustering of Categorical Data," Proc. Ninth Int'l Conf. Extending Database Technology, pp. 123-146, 2004. [12] E. Cesario, G. Manco, and R. Ortale, "Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 12, pp. 1607-1624, Dec. 2007. [13] F.Y. Cao et al., "A Framework for Clustering Categorical Time-Evolving Data," IEEE Trans. Fuzzy Systems, vol. 18, no. 5, pp. 872-885, Oct. 2010. [14] T.K. Xiong, S.R. Wang, A. Mayers, and E. Monga, "A New MCA-Based Divisive Hierarchical Algorithm for Clustering Categorical Data," Proc. Ninth IEEE Int'l Conf. Data Mining, pp. 1058-1063, 2009. [15] T.K. Xiong, S.R. Wang, A. Mayers, and E. Monga, "DHCC: Divisive Hierarchical Clustering of Categorical Data," Data Mining and Knowledge Discovery, vol. 24, no. 1, pp. 103-135, 2012. [16] Z.X. Huang, "A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining," Proc. SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery, pp. 1-8, 1997. [17] Z.X. Huang, "Extensions to the $k$ -Means Algorithm for Clustering Large Data Sets with Categorical Values," Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283-304, 1998. [18] J.B. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, pp. 281-297, 1967. [19] Z.X. Huang and M.K. Ng, "A Fuzzy $k$ -Modes Algorithm for Clustering Categorical Data," IEEE Trans. Fuzzy Systems, vol. 7, no. 4, pp. 446-452, Aug. 1999. [20] O. San, V. Huynh, and Y. Nakamori, "An Alternative Extension of the $k$ -Means Algorithm for Clustering Categorical Data," Int'l J. Applied Math. and Computer Science, vol. 14, no. 2, pp. 241-247, 2004. [21] D.W. Kim, K.Y. Lee, D.K. Lee, and K.H. Lee, "A $k$ -Populations Algorithm for Clustering Categorical Data," Pattern Recognition, vol. 38, no. 3, pp. 1131-1134, 2005. [22] Z. He, S. Deng, and X. Xu, "Improving $k$ -Modes Algorithm Considering Frequencies of Attribute Values in Mode," Proc. Int'l Conf. Computational Intelligence and Security, pp. 157-162, 2005. [23] M.K. Ng, M.J. Li, Z.X. Huang, and Z.Y. He, "On the Impact of Dissimilarity Measure in $k$ -Modes Clustering Algorithm," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 503-507, Mar. 2007. [24] M.K. Ng and L.P. Jing, "A New Fuzzy K-Modes Clustering Algorithm for Categorical Data," Int'l J. Granular Computing Rough Sets and Intelligent Systems, vol. 1, no. 1, pp. 105-119, 2009. [25] M. Lee and W. Pedrycz, "The Fuzzy C-Means Algorithm with Fuzzy P-Mode Prototypes for Clustering Objects Having Mixed Features," Fuzzy Sets and Systems, vol. 160, no. 24, pp. 3590-3600, 2009. [26] H.L. Chen, K.T. Chuang, and M.S. Chen, "On Data Labeling for Clustering Categorical Data," IEEE Trans. Knowledge and Data Eng., vol. 20, no. 11, pp. 1458-1472, Nov. 2008. [27] E.T. Jaynes, "Information Theory and Statistical Mechanics," Physical Rev. Series II, vol. 106, no. 4, pp. 620-630, 1957. [28] S. Miyamoto and M. Mukaidono, "Fuzzy C-Means as a Regularization and Maximum Entropy Approach," Proc. Seventh Int'l Fuzzy Systems Assoc. World Congress, vol. 2, pp. 86-92, 1997. [29] L.P. Jing, M.K. Ng, and Z.X. Huang, "An Entropy Weighting $k$ -Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 8, pp. 1026-1041, Aug. 2007. [30] J.C. Bezdek, "A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 2, no. 1, pp. 1-8, Jan. 1980. [31] J.C. Bezdek, R.J. Hathaway, M.J. Sabin, and W.T. Tucker, "Convergence Theory for Fuzzy C-Means: Counterexamples and Repairs," IEEE Trans. Systems, Man, and Cybernetics, vol. 17, no. 5, pp. 873-877, Sept./Oct. 1987. [32] W. Zangwill, Nonlinear Programming: A Unified Approach, chapter 4. Prentice-Hall, 1969. [33] UCI Machine Learning Repository, http://www.ics.uci.edu/mlearnMLRepository.html , 2010. [34] M. Zait and H. Messatfa, "A Comparative Study of Clustering Methods," Future Generation Computer Systems, vol. 13, pp. 149-159, 1997. [35] M.A. Gluck and J.E. Corter, "Information Uncertainty and the Utility of Categories," Proc. Seventh Ann. Conf. Cognitive Science Soc., pp. 283-287, 1985. [36] L. Hubert and P. Arabie, "Comparing Partitions," J. Classification, vol. 2, no. 1, pp. 193-218, 1985. [37] Y.M. Yang, "An Evaluation of Statistical Approaches to Text Categorization," J. Information Retrieval, vol. 1, nos. 1/2, pp. 67-88, 1999.