The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.09 - September (2009 vol.21)
pp: 1249-1262
Ping Luo , Chinese Academy of Sciences and Hewlett-Packard Labs China, Beijing
Hui Xiong , Rutgers University, Newark
Guoxing Zhan , Wayne State University, Detroit
Junjie Wu , Beihang University, Beijing
Zhongzhi Shi , Chinese Academy of Sciences, Beijing
ABSTRACT
This paper studies the generalization and normalization issues of information-theoretic distance measures for clustering validation. Along this line, we first introduce a uniform representation of distance measures, defined as quasi-distance, which is induced based on a general form of conditional entropy. The quasi-distance possesses three properties: symmetry, the triangle law, and the minimum reachable. These properties ensure that the quasi-distance naturally lends itself as the external measure for clustering validation. In addition, we observe that the ranges of the distance measures are different when they apply for clustering validation on different data sets. Therefore, when comparing the performances of clustering algorithms on different data sets, distance normalization is required to equalize ranges of the distance measures. A critical challenge for distance normalization is to obtain the ranges of a distance measure when a data set is provided. To that end, we theoretically analyze the computation of the maximum value of a distance measure for a data set. Finally, we compare the performances of the partition clustering algorithm K-means on various real-world data sets. The experiments show that the normalized distance measures have better performance than the original distance measures when comparing clusterings of different data sets. Also, the normalized Shannon distance has the best performance among four distance measures under study.
INDEX TERMS
Clustering validation, entropy, information-theoretic distance measures, K-means clustering.
CITATION
Ping Luo, Hui Xiong, Guoxing Zhan, Junjie Wu, Zhongzhi Shi, "Information-Theoretic Distance Measures for Clustering Validation: Generalization and Normalization", IEEE Transactions on Knowledge & Data Engineering, vol.21, no. 9, pp. 1249-1262, September 2009, doi:10.1109/TKDE.2008.200
REFERENCES
[1] J. Aczl and Z. Darczy, On Measures of Information and Their Characterizations. Academic Press, 1975.
[2] D. Barbará and P. Chen, “Using Self-Similarity to Cluster Large Data Sets,” Data Mining and Knowledge Discovery, vol. 7, no. 2, pp.123-152, 2003.
[3] D. Barbará, Y. Li, and J. Couto, “Coolcat: An Entropy-Based Algorithm for Categorical Clustering,” Proc. 11th ACM Int'l Conf. Information and Knowledge Management (CIKM '02), pp. 582-589, 2002.
[4] L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen, Classification and Regression Trees. Wadsworth Int'l Group, 1984.
[5] Y. Chen, Y. Zhang, and X. Ji, “Size Regularized Cut for Data Clustering,” Proc. 18th Conf. Neural Information Processing Systems (NIPS), 2005.
[6] M.H. DeGroot and M.J. Schervish, Probability and Statistics, thirded. Addison-Wesley, 2001.
[7] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On Clustering Validation Techniques,” J. Intelligent Information Systems, vol. 17, nos. 2/3, pp. 107-145, 2001.
[8] M. Halkidi, D. Gunopulos, N. Kumar, M. Vazirgiannis, and C. Domeniconi, “A Framework for Semi-Supervised Learning Based on Subjective and Objective Clustering Criteria,” Proc. Fifth IEEE Int'l Conf. Data Mining (ICDM '05), pp. 637-640, 2005.
[9] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1998.
[10] S. Jaroszewicz, D.A. Simovici, W. Kuo, and L. Ohno-Machado, “The Goodman-Kruskal Coefficient and Its Applications in the Genetic Diagnosis of Cancer,” IEEE Trans. Biomedical Eng., vol. 51, no. 7, pp. 1095-1102, July 2004.
[11] I. Jonyer, D.J. Cook, and L.B. Holder, “Graph-Based Hierarchical Conceptual Clustering,” J. Machine Learning Research, vol. 2, pp.19-43, 2001.
[12] I. Jonyer, L.B. Holder, and D.J. Cook, “Graph-Based Hierarchical Conceptual Clustering in Structural Databases,” Proc. 17th Nat'l Conf. Artificial Intelligence and 12th Conf. Innovative Applications of Artificial Intelligence (AAAI '00), p. 1078, 2000.
[13] G. Karypis, http://glaros.dtc.umn.edu/gkhome/viewscluto , 2008.
[14] J. Li, D. Tao, W. Hu, and X. Li, “Kernel Principle Component Analysis in Pixels Clustering,” Proc. IEEE/WIC/ACM. Int'l Conf. Web Intelligence (WI '05), pp. 786-789, 2005.
[15] W. Li, W.K. Ng, Y. Liu, and K.-L. Ong, “Enhancing the Effectiveness of Clustering with Spectra Analysis,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 7, pp. 887-902, July 2007.
[16] P. Luo, G. Zhan, Q. He, Z. Shi, and K. Lü, “On Defining Partition Entropy by Inequalities,” IEEE Trans. Information Theory, vol. 53, no. 9, pp. 3233-3239, 2007.
[17] R. Lopez De Mantaras, “A Distance-Based Attribute Selection Measure for Decision Tree Induction,” Machine Learning, vol. 6, no. 1, pp. 81-92, 1991.
[18] A.W. Marshall and I. Olkin, Inequalities: Theory of Majorization and Its Applications. Academic Press, 1979.
[19] M. Meila, “Comparing Clusterings: An Axiomatic View,” Proc. 22nd Int'l Conf. Machine Learning (ICML '05), pp. 577-584, 2005.
[20] N.R. Pal and S.K. Pal, “Entropy: A New Definition and Its Applications,” IEEE Trans. Systems Man and Cybernetics, vol. 21, no. 5, pp. 1260-1270, 1991.
[21] M.H. Protter and C.B. Morrey Jr., A First Course in Real Analysis, second ed. Springer, 1991.
[22] C.E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical J., vol. 27, pp. 379-423, 623-656, 1948.
[23] D.A. Simovici and S. Jaroszewicz, “An Axiomatization of Partition Entropy,” IEEE Trans. Information Theory, vol. 48, no. 7, pp. 2138-2142, 2002.
[24] D.A. Simovici and S. Jaroszewicz, “A Metric Approach to Building Decision Trees Based on Goodman-Kruskal Association Index,” Proc. Eighth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '04), pp. 181-190, 2004.
[25] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison-Wesley, 2005.
[26] H. Xiong, J. Wu, and J. Chen, “K-Means Clustering Versus Validation Measures: A Data Distribution Perspective,” Proc. 12th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '06), pp. 779-784, 2006.
21 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool