Subscribe
Issue No.04 - April (2008 vol.20)
pp: 462-474
ABSTRACT
In this paper we examine the problem of count data clustering. We analyze this problem using finite mixtures of distributions. The multinomial and the multinomial Dirichlet distributions are widely accepted to model count data. We show that these two distributions cannot be the best choice in all the applications and we propose another model called the multinomial generalized Dirichlet distribution (MGDD) that is the composition of the generalized Dirichlet distribution and the multinomial, in the same way that the multinomial Dirichlet distribution (MDD) is the composition of the Dirichlet and the multinomial. The estimation of the parameters and the determination of the number of components in our model are based on the deterministic annealing expectation-maximization (DAEM) approach and the minimum description length (MDL) criterion, respectively. We compare our method to standard approaches such as multinomial and multinomial Dirichlet mixtures to show its merits. The comparison involves different applications such as spatial color image databases indexing, handwritten digit recognition, and text documents clustering.
INDEX TERMS
clustering, Feature extraction, Image databases
CITATION
Nizar Bouguila, "Clustering of Count Data Using Generalized Dirichlet Multinomial Distributions", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 4, pp. 462-474, April 2008, doi:10.1109/TKDE.2007.190726
REFERENCES
 [1] G.J. McLachlan and D. Peel, Finite Mixture Models. John Wiley & Sons, 2000. [2] H. Tirri, P. Kontkanen, and P. Myllymäki, “Probabilistic Instance-Based Learning,” Proc. 13th Int'l Conf. Machine Learning (ICML '96), pp. 507-515, 1996. [3] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning, vol. 39, no. 2, pp. 103-134, 2000. [4] M. Meilã and D. Heckerman, “An Experimental Comparison of Model-Based Clustering Methods,” Machine Learning, vol. 42, no. 1-2, pp. 9-29, 2001. [5] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Survey, vol. 34, pp. 1-47, 2002. [6] P.S. Laplace, Philosophical Essay on Probabilities. Springer, 1995. [7] D. Lewis, “Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval,” Proc. 10th European Conf. Machine Learning (ECML '98), pp. 4-15, 1998. [8] J.D.M. Rennie, L. Shih, J. Teevan, and D.R. Karger, “Tackling the Poor Assumptions of Naive Bayes Text Classifiers,” Proc. 20th Int'l Conf. Machine Learning (ICML '03), pp. 616-623, 2003. [9] R.E. Madsen, D. Kauchak, and C. Elkan, “Modeling Word Burstiness Using the Dirichlet Distribution,” Proc. 22nd Int'l Conf. Machine Learning (ICML '05), pp. 545-552, 2005. [10] J. Teevan and D.R. Karger, “Empirical Development of an Exponential Probabilistic Model for Text Retrieval: Using Textual Analysis to Build a Better Model,” Proc. ACM SIGIR '03, pp. 18-25, 2003. [11] M.H. DeGroot, Optimal Statistical Decisions. Wiley-Interscience, 2004. [12] T.P. Minka, Estimating a Dirichlet Distribution, unpublished paper, http://research.microsoft.com/~minka/papers dirichlet/, 2007. [13] D.J.C. Mackay and L. Peto, “A Hierarchical Dirichlet Language Model,” Natural Language Eng., vol. 1, no. 3, pp. 1-19, 1994. [14] K. Sjolander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I.S. Mian, and D. Haussler, “Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology,” Computer Applications in the Biosciences, vol. 12, no. 4, pp. 327-345, 1996. [15] N. Bouguila and D. Ziou, “Unsupervised Learning of a Finite Discrete Mixture: Applications to Texture Modeling and Image Databases Summarization,” J. Visual Comm. and Image Representation, vol. 18, no. 4, pp. 295-309, 2007. [16] N. Bouguila and D. Ziou, “A Hybrid SEM Algorithm for High-Dimensional Unsupervised Learning Using a Finite Generalized Dirichlet Mixture,” IEEE Trans. Image Processing, vol. 15, no. 9, pp.2657-2668, 2006. [17] N. Bouguila and D. Ziou, “A Powerful Finite Mixture Model Based on the Generalized Dirichlet Distribution: Unsupervised Learning and Applications,” Proc. 17th Int'l Conf. Pattern Recognition (ICPR '04), pp. 280-283, 2004. [18] N. Bouguila and D. Ziou, “MML-Based Approach for High-Dimensional Learning Using the Generalized Dirichlet Mixture,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition (CVPR '05) Workshops, vol. 3, p. 53, 2005. [19] N. Bouguila and D. Ziou, “High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1716-1731, Oct. 2007. [20] C. Elkan, “Clustering Documents with an Exponential-Family Approximation of the Dirichlet Compound Multinomial Distribution,” Proc. 23rd Int'l Conf. Machine Learning (ICML '06), pp. 289-296, 2006. [21] K.W. Churchand and W.A. Gale, “Poisson Mixtures,” Natural Language Eng., vol. 1, pp. 163-190, 1995. [22] S.M. Katz, “Distribution of Content Words and Phrases in Text and Language Modelling,” Natural Language Eng., vol. 2, pp. 15-59, 1996. [23] N. Bouguila, D. Ziou, and J. Vaillancourt, “Unsupervised Learning of a Finite Mixture Model Based on the Dirichlet Distribution and Its Application,” IEEE Trans. Image Processing, vol. 13, no. 11, pp. 1533-1543, 2004. [24] J.E. Mosimann, “On the Compound Multinomial Distribution, the Multivariate $\beta\hbox{-}{\rm Distribution}$ , and Correlations among Proportions,” Biometrika, vol. 49, no. 1-2, pp. 65-82, 1962. [25] G. Pólya, “Sur Quelques Points de la Théorie des Probabilités,” Annales de l'Institut Henri Poincaré, vol. 1, no. 2, pp. 117-161, 1930. [26] J.G. Skellam, “A Probability Distribution Derived from the Binomial Distribution by Regarding the Probability of Success as Variable between the Sets of Trials,” J. Royal Statistical Soc., B, vol. 10, no. 2, pp. 257-261, 1948. [27] J.H. Darwin, “An Ecological Distribution Akin to Fisher's Logarithmic Distribution,” Biometrics, vol. 16, no. 1, pp. 51-60, 1960. [28] R.M. Dorazio and J.A. Royle, “Mixture Models for Estimating the Size of a Closed Population When Capture Rates Vary among Individuals,” Biometrics, vol. 59, no. 2, pp. 351-364, 2003. [29] R.H. Lochner, “A Generalized Dirichlet Distribution in Bayesian Life Testing,” J. Royal Statistical Soc., B, vol. 37, pp. 103-113, 1975. [30] R.J. Connor and J.E. Mosimann, “Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution,” J.Am. Statistical Assoc., vol. 64, pp. 194-206, 1969. [31] P.F. Thall and H.G. Sung, “Some Extensions and Applications of a Bayesian Strategy for Monitoring Multiple Outcomes in Clinical Trials,” Statistics in Medicine, vol. 17, pp. 1563-1580, 1998. [32] P. Lewy, “A Generalized Dirichlet Distribution Accounting for Sigularities of the Variables,” Biometrics, vol. 52, no. 4, pp. 1394-1409, 1996. [33] T. Wong, “Generalized Dirichlet Distribution in Bayesian Analysis,” Applied Math. and Computation, vol. 97, pp. 165-181, 1998. [34] N. Bouguila and D. Ziou, “Unsupervised Selection of a Finite Dirichlet Mixture Model: An MML-Based Approach,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 8, pp. 993-1009, Aug. 2006. [35] M.E. Tipping, “The Relevance Vector Machine,” Proc. Advances in Neural Information Processing Systems (NIPS '00), T.K. Leen, S.A.Solla, and K.-R. Müller, eds., pp. 652-658, 2000. [36] I.J. Good, The Estimation of Probabilities. MIT Press, 1965. [37] I.J. Good, Good Thinking: The Foundations of Probability and Its Applications. Univ. of Minneapolis Press, 1983. [38] X. Meng and D. van Dyk, “The EM Algorithm—An Old Folk Song Sung to a Fast New Tune,” J. Royal Statistical Soc., B, vol. 59, no. 3, pp. 511-567, 1997. [39] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., B, vol. 39, no. 1, pp. 1-38, 1977. [40] G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions. Wiley-Interscience, 1997. [41] N. Ueda and R. Nakano, “Deterministic Annealing EM Algorithm,” Neural Networks, vol. 11, pp. 271-282, 1998. [42] F.A. Graybill, Matrices with Applications in Statistics. Wadsworth, 1983. [43] E. Hille, Analytic Function Theory, vol. 1. Ginn and Co., 1959. [44] K. Lange, “Applications of the Dirichlet Distribution to Forensic Match Probabilities,” Genetica, vol. 96, pp. 107-117, 1995. [45] J. Rissanen, “Modeling by Shortest Data Description,” Automatica, vol. 14, pp. 465-471, 1987. [46] Y. Rui, T.S. Huang, and S. Chang, “Image Retrieval: Current Techniques, Promising Directions, and Open Issues,” J. Visual Comm. and Image Representation, vol. 10, no. 4, pp. 39-62, Apr. 1999. [47] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-Based Image Retrieval at the End of the Early Years,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349-1380, Dec. 2000. [48] W.W. Chu, C.C. Hsu, A.F. Cardenasand, and R.K. Taira, “Knowledge-Based Image Retrieval with Spatial and Temporal Constructs,” IEEE Trans. Knowledge and Data Eng., vol. 10, no. 6, pp.872-888, Nov./Dec. 1998. [49] G. Lu, “Techniques and Data Structures for Efficient Multimedia Retrieval Based on Similarity,” IEEE Trans. Multimedia, vol. 4, no. 3, pp. 372-384, 2002. [50] M.S. Lew, “Next-Generation Web Searches for Visual Content,” Computer, vol. 33, pp. 46-53, 2000. [51] M. Swain and D. Ballard, “Color Indexing,” Int'l J. Computer Vision, vol. 7, no. 1, pp. 11-32, 1991. [52] A. Pentland, R. Picard, and S. Sclaroff, “Photobook: Content-Based Manipulation of Image Databases,” Int'l J. Computer Vision, vol. 18, no. 3, pp. 233-254, 1996. [53] H.Y. Lee, H.K. Lee, and H.H. Yeong, “Spatial Color Descriptor for Image Retrieval and Video Segmentation,” IEEE Trans. Multimedia, vol. 5, no. 3, pp. 358-367, 2003. [54] G. Pass and R. Zabih, “Comparing Images Using Joint Histograms,” ACM J. Multimedia Systems, vol. 7, no. 3, pp. 234-240, 1999. [55] Y. Tao and W.I. Grosky, “Spatial Color Indexing Using Rotation, Translation, and Scale Invariant Anglograms,” Multimedia Tools and Applications, vol. 15, no. 3, pp. 247-268, 2001. [56] J. Huang, S.R. Kumar, M. Mitra, W. Zhu, and R. Zabih, “Spatial Color Indexing and Applications,” Int'l J. Computer Vision, vol. 35, no. 3, pp. 245-268, 1999. [57] R.M. Haralick, K. Shanmugan, and I. Dinstein, “Textural Features for Image Classification,” IEEE Trans. Systems, Man and Cybernetics, vol. 8, pp. 610-621, 1973. [58] G. Celeux and G. Govaert, “A Classification EM Algorithm for Clustering and Two Stochastic Versions,” Computational Statistics and Data Analysis, vol. 14, pp. 315-332, 1992. [59] G. Celeux and G. Govaert, “Comparison of the Mixture and the Classification Maximum Likelihood in Cluster Analysis,” J.Statistical Computation and Simulation, vol. 47, pp. 127-146, 1993. [60] G. Celeux and G. Govaert, “Gaussian Parsimonious Clustering Models,” Pattern Recognition, vol. 28, no. 5, pp. 781-793, 1995. [61] C. Biernacki, G. Celeux, and G. Govaert, “Assessing a Mixture Model for Clustering with the Integrated Complete Likelihood,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 7, pp. 719-725, July 2000. [62] C. Biernacki, G. Celeux, and G. Govaert, “An Improvement of the NEC Criterion for Assessing the Number of Clusters in a Mixture Model,” Pattern Recognition Letters, vol. 20, pp. 267-272, 1999. [63] A. Agresti, Categorical Data Analysis. John Wiley & Sons, 2002. [64] R. Plamondon and S.N. Srihari, “On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 63-84, Jan. 2000. [65] P.M. Murphy and D.W. Aha,, UCI Repository of Machine Learning Databases, Dept. of Information and Computer Science, Univ. of California, Irvine, http://www.ics.ci.edu/mlearnMLRepository. html. , 1998. [66] D. Verma and M. Meilã, “A Comparison of Spectral Clustering Algorithms,” Technical Report uw-cse-03-05-01, Univ. Washington, 2003. [67] A.L.I. Oliveira, C.A.B. Mello, E.R. Silva Jr, and V.M.O. Alves, “Optical Digit Recognition for Images of Handwritten Documents,” Proc. Ninth Brazilian Symp. Neural Networks (SBRN '06), p.92, 2006. [68] D. Keysers, C. Gollan, and H. Ney, “Local Context in Non-Linear Deformation Models for Handwritten Character Recognition,” Proc. 17th Int'l Conf. Pattern Recognition (ICPR '04), pp. 511-514, 2004. [69] A.K. McCallum, “Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering,” http://www.cs.cmu.edu/mccallumbow, 2007.