This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Information Theoretic Clustering
February 2002 (vol. 24 no. 2)
pp. 158-171

Abstract—Clustering is one of the important topics in pattern recognition. Since only the structure of the data dictates the grouping (unsupervised learning), information theory is an obvious criteria to establish the clustering rule. This paper describes a novel valley seeking clustering algorithm using an information theoretic measure to estimate the cost of partitioning the data set. The information theoretic criteria developed here evolved from a Renyi's entropy estimator that was proposed recently and has been successfully applied to other machine learning applications. An improved version of the k-change algorithm is used in optimization because of the stepwise nature of the cost function and existence of local minima. Even when applied to nonlinearly separable data, the new algorithm performs well, and was able to find nonlinear boundaries between clusters. The algorithm is also applied to the segmentation of magnetic resonance imaging data (MRI) with very promising results.

[1] M. Ashtari, J.L. Zito, B.I. Gold, J.A. Lieberman, M.T. Borenstein, and P.G. Herman, “Computerized Volume Measurement of Brain Structure,” Invest Radiology, vol. 25, pp. 798-805, 1990.
[2] A. Bhattacharya, “On a Measures of Divergence between Two Statistical Populations Defined by their Probability Distributions,” Bull. Calculus Math. Soc., vol. 35, pp. 99-109, 1943.
[3] J.C. Bezdek, “Review of MRI Image Segmentation Techniques Using Pattern Recognition,” Medical Physics, vol. 20, no. 4, pp. 1033-1048, 1993.
[4] C.M. Bishop, Neural Networks for Pattern Recognition. Clarendon Press, 1995.
[5] B.R. Buchbinder, J.W. Belliveau, R.C. McKinstry, H.J. Aronen, and M.S. Kennedy, “Functional MR Imaging of Primary Brain Tumors with PET Correlation,” Soc. Magnetic Resonance in Medicine, vol. 1, 1991.
[6] G. Carpenter and S. Grossberg, “A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine,” Computer Vision, Graphics and Image Understanding, vol. 37, pp. 54–115, 1987.
[7] G.A. Carpenter and S. Grossberg, “ART2: Self-Organization of Stable Category Recognition Codes for Analog Input Patterns,” Applied Optics, vol. 26, pp. 4919-4930, 1987.
[8] M. Charikar, C. Chekuri, T. Feder, and R. Motwani, “Incremental Clustering and Dynamic Information Retrieval,” Proc. Ann. ACM Symp. Theory of Computing, pp. 626-635, 1997.
[9] H. Chernoff, “A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on a Sum of Observations,” Ann. Math. Statistics, vol. 23, pp. 493-507, 1952.
[10] T.M. Cover and J.A. Thomas, Elements of Information Theory. John Wiley&Sons, 1991.
[11] A.P. Demspter, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistic Soc., Series B, no. 39, pp. 1-38, 1977.
[12] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973.
[13] K. Fukunaga, Introduction to Statistical Pattern Recognition, second edition. Academic Press, 1990.
[14] A. Gersho, “On the Structure of Vector Quantizers,” IEEE Trans. Information Theory, vol. 28, pp. 157-166, 1982.
[15] E. Gokcay and J. Principe, “A New Clustering Evaluation Function Using Renyi's Information Potential,” IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, 2000.
[16] R.M. Gray, "Vector Quantization," IEEE Acoustics, Speech and Signal Processing, pp. 4-29, Apr. 1984.
[17] R.M. Gray, Entropy and Information Theory. New York: Springer-Verlag, 1990.
[18] S. Grossberg, “Adaptive Pattern Classification and Universal Recording: I. Parallel Development and Coding of Neural Feature Detectors,” Biological Cybernetics, vol. 23, pp. 121-134, 1976.
[19] R.M. Haralick and L.G. Shapiro, “Image Segmentation Techniques,” Computer Vision, Graphics, and Image Processing, vol. 29, pp. 100-132, 1985.
[20] J.A Hartigan,Clustering Algorithms, John Wiley and Sons, New York, N.Y., 1975.
[21] S. Haykin, Neural Networks. Piscataway, N.J.: IEEE Press, 1994.
[22] T. Hofmann and M. Buhmann, Pairwise Data Clustering by Deterministic Annealing IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 1, pp. 1-14, Jan. 1997.
[23] C.R. Jack, M.D. Bentley, C.K. Twomey, and A.R. Zinsmeister, “MR Imaging-Based Volume Measurement of the Hippocampal Formation and Anterior Temporal Lobe,” Radiology, vol. 176, pp. 205-209, 1990.
[24] E.F. Jackson, P.A. Narayana, J.S. Wolinksy, and T.J. Doyle, “Accuracy and Reproducibility in Volumetric Analysis of Multiple Sclerosis Lesions,” J. Computer Assisted Tomogrophy, vol. 17, pp. 200-205, 1993.
[25] J.N. Kapur and H.K. Kesavan, Entropy Optimization Principles with Applications. Boston, Mass.: Academic Press, 1992.
[26] N. Kato and Y. Nemoto, “Large Scale Hand-Written Character Recognition System Using Subspace Method,” Proc. IEEE Int'l Conf. Systems, Man and Cybernetics, vol. 1, pp. 432-437, 1996.
[27] D. Kazakos and P. Kazakos Papantoni, Detection and Estimation. New York: Computer Science Press, 1990.
[28] K. Kido, J. Miwa, S. Makino, and Y. Niitsu, “Spoken Word Recognition System for Unlimited Speakers,” IEEE Int'l Conf. Acoust Speech Signal Process, pp. 735- 738, 1978.
[29] S. Kullback, Information Theory and Statistics, New York: John Wiley, 1959.
[30] K. Le, Z. Huang, C.W. Moon, and A. Tzes, “Adaptive Thresholding—A Robust Fault Detection Approach,” Proc. IEEE Conf. Decision and Control, vol. 5, pp. 4490-4495, 1997.
[31] R.P. Lippman, “Review of Neural Networks for Speech Recognition,” Neural Computation, vol. 1, pp. 1-38, 1989.
[32] G.J. McLachlan and K.E. Basford, Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker, 1988.
[33] G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions. New York: John Wiley&Sons, 1996.
[34] N. Morgan and H. Franco, “Applications of Neural Networks to Speech Recognition,” IEEE Signal Processing Magazine, vol. 14, no. 6, pp. 46-47, 1997.
[35] A. Navarro and C.R. Allen, “Adaptive Classifier Based on K-Means Clustering and Dynamic Programing,” Proc. Int'l Soc. for Optical Eng., vol. 3027, pp. 31-38, 1997.
[36] N.R. Pal and S.K. Pal, “A Review on Image Segmentation Techniques,” Pattern Recognition, vol. 26, no. 9, pp. 1277-1294, 1993.
[37] A. Papoulis, Probability, Random Variables and Stochastic Processes. New York: McGraw-Hill, 1965.
[38] E. Parzen, “On Estimation of a Probability Density Function and Mode,” Annals of Math. Statistics, vol. 33, pp. 1065-1076, 1962.
[39] J.C. Principe, N.R. Euliano, and W.C. Lefebvre, Neural and Adaptive Systems: Fundamentals through Simulations. New York: John Wiley&Sons, 2000.
[40] J.C. Principe, C. Wang, and D. Xu, “Speaker Verification and Identification Using Gamma Neural Networks,” IEEE Int'l Conf. Neural Networks, vol. 4, pp. 2085-2088, 1997.
[41] J. Principe, D. Xu, and J. Fisher, “Information Theoretic Learning,” Unsupervised Adaptive Filtering, S. Haykin, ed., New York: John Wiley&Sons, 2000.
[42] V. Radova and V. Psutka, “Approach to Speaker Identification Using Multiple Classifiers,” Speech Processing ICASSP, IEEE Int'l Conf. Acoustics, Speech and Signal Processing, vol. 2, pp. 1135-1138, 1997.
[43] A. Reiss, M. Abrams, H. Singer, J. Ross, and M. Denckla, “Brain Development, Gender and IQ in Children: A Volumetric Imaging Study,” Brain, vol. 119, pp. 1763-1774, 1996.
[44] A. Renyi, “On Measures of Entropy and Information,” Proc. Fourth Berkeley Symp. Math., Statistics, and Probability, pp. 547-561, 1960.
[45] S. Roberts, D. Husmeier, I. Rezek, and W. Penny, Bayesian Approaches to Gaussian Mixture Modeling IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 11, Nov. 1998.
[46] D.E. Rumelhart and D. Zipser, “Feature Discovery by Competitive Learning,” Cognitive Science, vol. 9, pp. 75-112, 1985.
[47] C.E. Shannon, “A Mathematical Theory of Communications,” Bell System Technical J., vol. 27, pp. 370-423, 1948.
[48] C.E. Shannon and W. Weaver, The Mathematical Theory of Communication. Urbana, Ill.: The Univ. of Illinois Press, 1962.
[49] S. Shimoji and S. Lee, “Data Clustering with Entropical Scheduling,” IEEE Int'l Conf. Neural Networks, pp. 2423-2428, 1994.
[50] S. Watanabe, Pattern Recognition: Human and Mechanical. New York: John Wiley&Sons, 1985.
[51] D. Xu and J. Principe, “Learning from Examples with Quadratic Mutual Information,” Neural Networks for Signal Processing—Proc. IEEE Workshop, pp. 155-164, 1998.

Index Terms:
Information theory, clustering, MRI segmentation, entropy, optimization.
Citation:
Erhan Gokcay, Jose C. Principe, "Information Theoretic Clustering," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 158-171, Feb. 2002, doi:10.1109/34.982897
Usage of this product signifies your acceptance of the Terms of Use.