This Article 
 Bibliographic References 
 Add to: 
Unsupervised Learning of Finite Mixture Models
March 2002 (vol. 24 no. 3)
pp. 381-396

This paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective “unsupervised” is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectation-maximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments involving Gaussian mixtures. These experiments testify for the good performance of our approach.

[1] J. Banfield and A. Raftery, “Model-Based Gaussian and Non-Gaussian Clustering,” Biometrics, vol. 49, pp. 803-821, 1993.
[2] H. Bensmail, G. Celeux, A. Raftery, and C. Robert, “Inference in Model-Based Cluster Analysis,” Statistics and Computing, vol. 7, pp. 1-10, 1997.
[3] J. Bernardo and A. Smith, Bayesian Theory. Chichester, UK: J. Wiley&Sons, 1994.
[4] D. Bertsekas, Nonlinear Programming. Belmont, Mass.: Athena Scientific, 1999.
[5] C. Biernacki, G. Celeux, and G. Govaert, Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, pp. 719-725, 2000.
[6] C. Biernacki, G. Celeux, and G. Govaert, “An Improvement of the NEC Criterion for Assessing the Number of Clusters in a Mixture Model,” Pattern Recognition Letters, vol. 20, pp. 267-272, 1999.
[7] C. Biernacki and G. Govaert, “Using the Classification Likelihood to Choose the Number of Clusters,” Computing Science and Statistics, vol. 29, pp. 451-457, 1997.
[8] H. Bozdogan, “Choosing the Number of Component Clusters in the Mixture Model Using a New Informational Complexity Criterion of the Inverse-Fisher Information Matrix,” Information and Classification, O. Opitz, B. Lausen, and R. Klar, eds., pp. 40-54, Springer Verlag, 1993.
[9] M. Brand, “Structure Discovery in Conditional Probability Models via an Entropic Prior and Parameter Extinction,” Neural Computation, vol. 11, no. 5, pp. 1,155-1.182, 1999.
[10] J. Campbell, C. Fraley, F. Murtagh, and A. Raftery, “Linear Flaw Detection in Woven Textiles Using Model-Based Clustering,” Pattern Recognition Letters, vol. 18, pp. 1539-1548, 1997.
[11] G. Celeux, S. Chrétien, F. Forbes, and A. Mkhadri, “A Component-Wise EM Algorithm for Mixtures,” Technical Report 3746, INRIA Rhône-Alpes, France, 1999. Available at
[12] G. Celeux and G. Soromenho, “An Entropy Criterion for Assessing the Number of Clusters in a Mixture Model,” Classification J., vol. 13, pp. 195-212, 1996.
[13] S. Chrétien and A. HeroIII, “Kullback Proximal Algorithms for Maximum Likelihood Estimation,” IEEE Trans. Information Theory, vol. 46, pp. 1800-1810, 2000.
[14] J. Conway and N. Sloane, Sphere Packings, Lattices, and Groups. New York: Springer Verlag, 1993.
[15] T.M. Cover and J.A. Thomas, Elements of Information Theory. John Wiley&Sons, 1991.
[16] S. Dalal and W. Hall, “Approximating Priors by Mixtures of Natural Conjugate Priors,” J. Royal Statistical Soc. (B), vol. 45, 1983.
[17] A. Dasgupta and A. Raftery, “Detecting Features in Spatial Point Patterns with Clutter Via Model-Based Clustering,” J. Am. Statistical Assoc., vol. 93, pp. 294-302, 1998.
[18] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood Estimation from Incomplete Data Via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977.
[19] R. Duda and P. Hart, Pattern Classification and Scene Analysis. New York: John Wiley&Sons, 1973.
[20] M. Figueiredo and A.K. Jain, “Unsupervised Selection and Estimation of Finite Mixture Models,” Proc. Int'l Conf. Pattern Recognition—ICPR-2000, pp. 87-90, 2000.
[21] M. Figueiredo, J. Leitão, and A.K. Jain, On Fitting Mixture Models Energy Minimization Methods in Computer Vision and Pattern Recognition, E. Hancock and M. Pellilo, eds., pp. 54-69, Springer-Verlag, 1999.
[22] C. Fraley and A. Raftery, “How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis,” Technical Report 329, Dept. Statistics, Univ. Washington, Seattle, WA, 1998.
[23] Z. Ghahramani and M. Beal, “Variational Inference for Bayesian Mixtures of Factor Analyzers,” Advances in Neural Information Processing Systems 12, S. Solla, T. Leen, and K.-R. Müller, eds., pp. 449-455, MIT Press, 2000.
[24] Z. Ghahramani and G. Hinton, “The EM Algorithm for Mixtures of Factor Analyzers,” Technical Report CRG-TR-96-1, Univ. of Toronto, Canada, 1997.
[25] T. Hastie and R. Tibshirani, “Discriminant Analysis by Gaussian Mixtures,” J. Royal Statistical Soc. (B), vol. 58, pp. 155-176, 1996.
[26] G.E. Hinton, P. Dayan, and M. Revow, Modeling the Manifolds of Images of Handwritten Digits IEEE Trans. Neural Networks, vol. 8, no. 1, pp. 65-74, Jan. 1997.
[27] T. Hofmann and M. Buhmann, Pairwise Data Clustering by Deterministic Annealing IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 1, pp. 1-14, Jan. 1997.
[28] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[29] A.K. Jain, R.P.W. Duin, and J. Mao, Statistical Pattern Recognition: A Review IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000.
[30] A.K. Jain and F. Farrokhnia, “Unsupervised Texture Segmentation Using Gabor Filters,” Pattern Recognition, vol. 24, no. 12, pp. 1167-1186, 1991.
[31] M. Kloppenburg and P. Tavan, “Deterministic Annealing for Density Estimation by Multivariate Normal Mixtures,” Physical Rev. E, vol. 55, pp. R2089-R2092, 1997.
[32] A. Lanterman, “Schwarz, Wallace, and Rissanen: Intertwining Themes in Theories of Model Order Estimation,” Int'l Statistical Rev., vol. 69, pp. 185-212, Aug. 2001.
[33] G. McLachlan, “On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture,” J. Royal Statistical Soc. Series (C), vol. 36, pp. 318-324, 1987.
[34] G. McLachlan, Discriminant Analyis and Statistical Pattern Recognition. New York: John Wiley&Sons, 1992.
[35] G. McLachlan and K. Basford, Mixture Models: Inference and Application to Clustering. New York: Marcel Dekker, 1988.
[36] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions. New York: John Wiley&Sons, 1997.
[37] G. McLachlan and D. Peel, Finite Mixture Models. New York: John Wiley&Sons, 2000.
[38] P. Meinicke and H. Ritter, “Resolution-Based Complexity Control for Gaussian Mixture Models,” Neural Computation, vol. 13, no. 2, pp. 453-475, 2001.
[39] K. Mengersen and C. Robert, “Testing for Mixtures: A Bayesian Entropic Approach,” Proc. Fifth Valencia Int'l Meeting Bayesian Statistsics 5, J. Bernardo, J. Berger, A. Dawid, and F. Smith, eds., pp. 255-276, 1996.
[40] R. Neal, “Bayesian Mixture Modeling,” Proc. 11th Int'l Workshop Maximum Entropy and Bayesian Methods of Statistical Analysis, pp. 197-211, 1992.
[41] R. Neal and G. Hinton, “A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants,” Learning in Graphical Models, M.I. Jordan, ed., pp. 355-368, Kluwer Academic Publishers, 1998.
[42] J. Oliver, R. Baxter, and C. Wallace, “Unsupervised Learning Using MML,” Proc. 13th Int'l Conf. Machine Learning, pp. 364-372, 1996.
[43] P. Pudil, J. Novovicova, and J. Kittler, “Feature Selection Based on the Approximation of Class Densities by Finite Mixtures of the Special Type,” Pattern Recognition, vol. 28, no. 9, pp. 1389-1398, 1995.
[44] A. Rangarajan, “Self Annealing: Unifying Deterministic Annealing and Relaxation Labeling,” Energy Minimization Methods in Computer Vision and Pattern Recognition, M. Pellilo and E. Hancock, eds., pp. 229-244, Springer Verlag, 1997.
[45] C. Rasmussen, “The Infinite Gaussian Mixture Model,” Advances in Neural Information Processing Systems 12, S. Solla, T. Leen, and K.-R. Müller, eds., pp. 554-560, MIT Press, 2000.
[46] S.J. Raudys and A.K. Jain, "Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, pp. 252-264, 1991.
[47] S. Raudys and V. Pikelis, “On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithms in Pattern Recognition,” IEEE Trans. Pattern Analysis and Machice Intelligence, vol. 2, pp. 243-252, 1980.
[48] S. Richardson and P. Green, “On Bayesian Analysis of Mixtures with Unknown Number of Components,” J. Royal Statistical Soc. B, vol. 59, pp. 731-792, 1997.
[49] J. Rissanen, Stochastic Complexity in Statistical Inquiry. World Scientific Series in Computer Science, vol. 15, 1989.
[50] S. Roberts, D. Husmeier, I. Rezek, and W. Penny, Bayesian Approaches to Gaussian Mixture Modeling IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 11, Nov. 1998.
[51] K. Roeder and L. Wasserman, “Practical Bayesian Density Estimation Using Mixtures of Normals,” J. Am. Statistical Assoc., vol. 92, pp. 894-902, 1997.
[52] K. Rose, “Deterministic Annealing for Clustering, Compression, Classification, Regression and Related Optimization Problems,” Proc. IEEE, vol. 86, pp. 2,210-2,239, 1998.
[53] G. Schwarz, “Estimating the Dimension of a Model, Annals of Statistics, vol. 6, pp. 461-464, 1978.
[54] P. Smyth, “Model Selection for Probabilistic Clustering Using Cross-Validated Likelihood,” Statistics and Computing, vol. 10, pp. 63-72, 2000.
[55] R. Streit and T. Luginbuhl, “Maximum Likelihood Training of Probabilistic Neural Networks,” IEEE Trans. Neural Networks, vol. 5, no. 5, pp. 764-783, 1994.
[56] M.E. Tipping and C.M. Bishop, “Mixtures of Probabilistic Principal Component Analysers,” Neural Computation, vol. 11, no. 2, pp. 443-482, 1999.
[57] D. Titterington, A. Smith, and U. Makov, Statistical Analysis of Finite Mixture Distributions. Chichester, U.K.: John Wiley&Sons, 1985.
[58] N. Ueda and R. Nakano, Deterministic Annealing EM Algorithm Neural Networks, vol. 11, pp. 271-282, 1998.
[59] N. Ueda, R. Nakano, Z. Gharhamani, and G. Hinton, “SMEM Algorithm for Mixture Models,” Neural Computation, vol. 12, pp. 2109-2128, 2000.
[60] C. Wallace and D. Dowe, “Minimum Message Length and Kolmogorov Complexity,” The Computer J., vol. 42, no. 4, pp. 270-283, 1999.
[61] C. Wallace and P. Freeman, “Estimation and Inference Via Compact Coding,” J. Royal Statistical Soc. (B), vol. 49, no. 3, pp. 241-252, 1987.
[62] M. Whindham and A. Cutler, “Information Ratios for Validating Mixture Analysis,” J. Am. Statistcal Assoc., vol. 87, pp. 1188-1192, 1992.
[63] L. Xu and M.I. Jordan, “On Convergence Properties of the EM Algorithm for Gaussian Mixtures,” Neural Computation, vol. 8, pp. 129-151, 1996.
[64] A. Zellner, “Maximal Data Information Prior Distributions,” New Developments in the Applications of Bayesian Methods, A. Aykac and C. Brumat, eds., pp. 211-232, Amsterdam: North Holland, 1977.

Index Terms:
finite mixtures, unsupervised learning, model selection, minimum message length criterion, Bayesian methods, expectation-maximization algorithm, clustering
M.A.T. Figueiredo, A.K. Jain, "Unsupervised Learning of Finite Mixture Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 381-396, March 2002, doi:10.1109/34.990138
Usage of this product signifies your acceptance of the Terms of Use.