Subscribe

Issue No.09 - September (2009 vol.21)

pp: 1249-1262

Hui Xiong , Rutgers University, Newark

Guoxing Zhan , Wayne State University, Detroit

Junjie Wu , Beihang University, Beijing

Zhongzhi Shi , Chinese Academy of Sciences, Beijing

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.200

ABSTRACT

This paper studies the generalization and normalization issues of information-theoretic distance measures for clustering validation. Along this line, we first introduce a uniform representation of distance measures, defined as quasi-distance, which is induced based on a general form of conditional entropy. The quasi-distance possesses three properties: symmetry, the triangle law, and the minimum reachable. These properties ensure that the quasi-distance naturally lends itself as the external measure for clustering validation. In addition, we observe that the ranges of the distance measures are different when they apply for clustering validation on different data sets. Therefore, when comparing the performances of clustering algorithms on different data sets, distance normalization is required to equalize ranges of the distance measures. A critical challenge for distance normalization is to obtain the ranges of a distance measure when a data set is provided. To that end, we theoretically analyze the computation of the maximum value of a distance measure for a data set. Finally, we compare the performances of the partition clustering algorithm K-means on various real-world data sets. The experiments show that the normalized distance measures have better performance than the original distance measures when comparing clusterings of different data sets. Also, the normalized Shannon distance has the best performance among four distance measures under study.

INDEX TERMS

Clustering validation, entropy, information-theoretic distance measures, K-means clustering.

CITATION

Hui Xiong, Guoxing Zhan, Junjie Wu, Zhongzhi Shi, "Information-Theoretic Distance Measures for Clustering Validation: Generalization and Normalization",

*IEEE Transactions on Knowledge & Data Engineering*, vol.21, no. 9, pp. 1249-1262, September 2009, doi:10.1109/TKDE.2008.200REFERENCES

- [1] J. Aczl and Z. Darczy,
On Measures of Information and Their Characterizations. Academic Press, 1975.- [2] D. Barbará and P. Chen, “Using Self-Similarity to Cluster Large Data Sets,”
Data Mining and Knowledge Discovery, vol. 7, no. 2, pp.123-152, 2003.- [3] D. Barbará, Y. Li, and J. Couto, “Coolcat: An Entropy-Based Algorithm for Categorical Clustering,”
Proc. 11th ACM Int'l Conf. Information and Knowledge Management (CIKM '02), pp. 582-589, 2002.- [4] L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen,
Classification and Regression Trees. Wadsworth Int'l Group, 1984.- [5] Y. Chen, Y. Zhang, and X. Ji, “Size Regularized Cut for Data Clustering,”
Proc. 18th Conf. Neural Information Processing Systems (NIPS), 2005.- [6] M.H. DeGroot and M.J. Schervish,
Probability and Statistics, thirded. Addison-Wesley, 2001.- [7] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On Clustering Validation Techniques,”
J. Intelligent Information Systems, vol. 17, nos. 2/3, pp. 107-145, 2001.- [9] A.K. Jain and R.C. Dubes,
Algorithms for Clustering Data. Prentice Hall, 1998.- [12] I. Jonyer, L.B. Holder, and D.J. Cook, “Graph-Based Hierarchical Conceptual Clustering in Structural Databases,”
Proc. 17th Nat'l Conf. Artificial Intelligence and 12th Conf. Innovative Applications of Artificial Intelligence (AAAI '00), p. 1078, 2000.- [13] G. Karypis, http://glaros.dtc.umn.edu/gkhome/viewscluto , 2008.
- [17] R. Lopez De Mantaras, “A Distance-Based Attribute Selection Measure for Decision Tree Induction,”
Machine Learning, vol. 6, no. 1, pp. 81-92, 1991.- [18] A.W. Marshall and I. Olkin,
Inequalities: Theory of Majorization and Its Applications. Academic Press, 1979.- [19] M. Meila, “Comparing Clusterings: An Axiomatic View,”
Proc. 22nd Int'l Conf. Machine Learning (ICML '05), pp. 577-584, 2005.- [21] M.H. Protter and C.B. Morrey Jr.,
A First Course in Real Analysis, second ed. Springer, 1991.- [22] C.E. Shannon, “A Mathematical Theory of Communication,”
Bell System Technical J., vol. 27, pp. 379-423, 623-656, 1948.- [24] D.A. Simovici and S. Jaroszewicz, “A Metric Approach to Building Decision Trees Based on Goodman-Kruskal Association Index,”
Proc. Eighth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '04), pp. 181-190, 2004.- [25] P.-N. Tan, M. Steinbach, and V. Kumar,
Introduction to Data Mining. Addison-Wesley, 2005.- [26] H. Xiong, J. Wu, and J. Chen, “K-Means Clustering Versus Validation Measures: A Data Distribution Perspective,”
Proc. 12th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '06), pp. 779-784, 2006. |