This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data
December 2007 (vol. 19 no. 12)
pp. 1607-1624
A parameter-free, fully-automatic approach to clustering high-dimensional categorical data is proposed. The technique is based on a two-phase iterative procedure, which attempts to improve the overall quality of the whole partition. In the first phase, cluster assignments are given, and a new cluster is added to the partition by choosing and splitting a low-quality cluster. In the second phase, the number of clusters is fixed, and an attempt to optimize cluster assignments is done. On the basis of such features, the algorithm attempts to improve the overall quality of the whole partition and finds clusters in the data, whose number is naturally established on the basis of the inherent features of the underlying dataset, rather than being previously specified. Furthermore, the approach is parametric to the notion of cluster quality: here, a cluster is defined as a set of tuples exhibiting a sort of homogeneity. We show how a suitable notion of cluster homogeneity can be defined in the context of high dimensional categorical data, from which an effective instance of the proposed clustering scheme immediately follows. Experiments on both synthetic and real data prove that the devised algorithm scales linearly and achieves nearly-optimal results in terms of compactness and separation.

[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 94-105, 1998.
[2] P. Andritsos, P. Tsaparas, R. Miller, and K. Sevcik, “LIMBO: Scalable Clustering of Categorical Data,” Proc. Ninth Int'l Conf. Extending Database Technology (EDBT '04), pp. 123-146, 2004.
[3] D. Barbará, J. Couto, and Y. Li, “COOLCAT: An Entropy-Based Algorithm for Categorical Clustering,” Proc. 11th ACM Conf. Information and Knowledge Management (CIKM '02), pp. 582-589, 2002.
[4] J. Basak and R. Krishnapuram, “Interpretable Hierarchical Clustering by Constructing an Unsupervised Decision Tree,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 1, Jan. 2005.
[5] D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent Dirichlet Allocation,” J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[6] H. Blockeel, L.D. Raedt, and J. Ramon, “Top-Down Induction of Clustering Trees,” Proc. 15th Int'l Conf. Machine Learning (ICML '98), pp. 55-63, 1998.
[7] I. Cadez, P. Smyth, and H. Mannila, “Probabilistic Modeling of Transaction Data with Applications to Profiling, Visualization, and Prediction,” Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), pp. 37-46, 2001.
[8] M. Carreira-Perpinan and S. Renals, “Practical Identifiability of Finite Mixture of Multivariate Distributions,” Neural Computation, vol. 12, no. 1, pp. 141-152, 2000.
[9] S. Deerwester et al., “Indexing by Latent Semantic Analysis,” J.Am. Soc. Information Science, vol. 41, no. 6, 1990.
[10] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Eighth ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '96), pp. 226-231, 1996.
[11] D. Fisher, “Knowledge Acquisition via Incremental Conceptual Clustering,” Machine Learning, vol. 2, pp. 139-172, 1987.
[12] C. Fraley and A. Raftery, “How Many Clusters? Which Clustering Method? The Answer via Model-Based Cluster Analysis,” The Computer J., vol. 41, no. 8, 1998.
[13] G. Gan and J. Wu, “Subspace Clustering for High Dimensional Categorical Data,” SIGKDD Explorations, vol. 6, no. 2, pp. 87-94, 2004.
[14] V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS: Clustering Categorical Data Using Summaries,” Proc. Fifth ACM Conf. Knowledge Discovery and Data Mining (KDD '99), pp. 73-83, 1999.
[15] A. Gersho and R. Gray, Vector Quantization and Signal Compression. Kluwer Academic Publishers, 1991.
[16] D. Gibson, J. Kleinberg, and P. Raghavan, “Clustering Categorical Data: An Approach Based on Dynamical Systems,” VLDB J., vol. 8, pp. 222-236, 2000.
[17] A. Gordon, Classification. Chapman and Hall/CRC Press, 1999.
[18] C. Gozzi, F. Giannotti, and G. Manco, “Clustering Transactional Data,” Proc. Sixth European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '02), pp. 175-187, 2002.
[19] J. Grabmeier and A. Rudolph, “Techniques of Cluster Algorithms in Data Mining,” Data Mining and Knowledge Discovery, vol. 6, no. 4, pp. 303-360, 2002.
[20] S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases,” Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '98), pp. 73-84, 1998.
[21] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Information Systems, vol. 25, no. 5, pp. 345-366, 2001.
[22] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Cluster Validity Methods,” SIGMOD Record, vol. 31, nos. 1-2, 2002.
[23] E. Han, G. Karypis, V. Kumar, and B. Mobasher, “Clustering in a High Dimensional Space Using Hypergraph Models,” Proc. ACM SIGMOD Workshops Research Issues on Data Mining and Knowledge Discovery (DMKD '97), 1997.
[24] Z. Huang, “Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283-304, 1998.
[25] A. Jain and R. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.
[26] E. Keogh, S. Lonardi, and C. Ratanamahatana, “Towards Parameter-Free Data Mining,” Proc. 10th ACM Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 206-215, 2004.
[27] B. Liu, Y. Xia, and P. Yu, “Clustering through Decision Tree Construction,” Proc. Ninth Int'l Conf. Information and Knowledge Management (CIKM '00), pp. 20-29, 2000.
[28] A. McCallum, K. Nigam, and L.H. Ungar, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching,” Proc. Sixth Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 169-178, 2000.
[29] G. McLachlan and D. Peel, Finite Mixture Models. John Wiley & Sons, 2000.
[30] M. Meila and D. Heckerman, “An Experimental Comparison of Model-Based Clustering Methods,” Machine Learning, vol. 42, no. 1/2, pp. 9-29, 2001.
[31] R. Ng and J. Han, “CLARANS: A Method for Clustering Objects for Spatial Data Mining,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 5, pp. 1003-1016, Sept./Oct. 2002.
[32] M. Ozdal and C. Aykanat, “Hypergraph Models and Algorithms for Data-Pattern-Based Clustering,” Data Mining and Knowledge Discovery, vol. 9, pp. 29-57, 2004.
[33] L. Parsons, E. Haque, and H. Liu, “Subspace Clustering for High-Dimensional Data: A Review,” SIGKDD Explorations, vol. 6, no. 1, pp. 90-105, 2004.
[34] D. Pelleg and A. Moore, “X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters,” Proc. 17th Int'l Conf. Machine Learning (ICML '00), pp. 727-734, 2000.
[35] J.G.S. Zhong, “Generative Model-Based Document Clustering: A Comparative Study,” Knowledge and Information Systems, vol. 8, no. 3, pp. 374-384, 2005.
[36] P. Smyth, “Model Selection for Probabilistic Clustering Using Cross-Validated Likelihood,” Statistics and Computing, vol. 10, no. 1, pp. 63-72, 2000.
[37] M. Sultan et al., “Binary Tree-Structured Vector Quantization Approach to Clustering and Visualizing Microarray Data,” Bioinformatics, vol. 18, 2002.
[38] M.O.T. Li and S. Ma, “Entropy-Based Criterion in Categorical Clustering,” Proc. 21st Int'l Conf. Machine Learning (ICML '04), pp.68-75, 2004.
[39] K. Wang, C. Xu, and B. Liu, “Clustering Transactions Using Large Items,” Proc. Eighth Int'l Conf. Information and Knowledge Management (CIKM '99), pp. 483-490, 1999.
[40] Y. Yang, X. Guan, and J. You, “CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data,” Proc. Eighth ACM Conf. Knowledge Discovery and Data Mining (KDD '02), pp. 682-687, 2002.
[41] M. Zaki and M. Peters, “CLICK: Mining Subspace Clusters in categorical Data via $k$ -Partite Maximal Cliques,” Proc. 21st Int'l Conf. Data Eng. (ICDE '05), 2005.
[42] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '96), pp. 103-114, 1996.

Index Terms:
Clustering, Database Applications - Clustering, Information Search and Retrieval - Clustering
Citation:
Eugenio Cesario, Giuseppe Manco, Riccardo Ortale, "Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 12, pp. 1607-1624, Dec. 2007, doi:10.1109/TKDE.2007.190649
Usage of this product signifies your acceptance of the Terms of Use.