This Article 
 Bibliographic References 
 Add to: 
Mutual Information Theory for Adaptive Mixture Models
April 2001 (vol. 23 no. 4)
pp. 396-403

Abstract—Many pattern recognition systems need to estimate an underlying probability density function (pdf). Mixture models are commonly used for this purpose in which an underlying pdf is estimated by a finite mixing of distributions. The basic computational element of a density mixture model is a component with a nonlinear mapping function, which takes part in mixing. Selecting an optimal set of components for mixture models is important to ensure an efficient and accurate estimate of an underlying pdf. Previous work has commonly estimated an underlying pdf based on the information contained in patterns. In this paper, mutual information theory is employed to measure whether two components are statistically dependent. If a component has small mutual information, it is statistically independent of the other components. Hence, that component makes a significant contribution to the system pdf and should not be removed. However, if a particular component has large mutual information, it is unlikely to be statistically independent of the other components and may be removed without significant damage to the estimated pdf. Continuing to remove components with large and positive mutual information will give a density mixture model with an optimal structure, which is very close to the true pdf.

[1] R. Battiti, “Using Mutual Information for Selecting Features in Supervised Neural Net Learning,” IEEE Trans. Neural Networks, vol. 5, pp. 537-550, July 1994.
[2] M. Bichsel and P. Seitz, “Minimum Class Entropy: A Maximum Information Approach to Layered Networks,” Neural Networks, vol. 2, no. 2, pp. 133-141, 1989.
[3] C.M. Bishop, Neural Networks for Pattern Recognition. Clarendon Press, 1995.
[4] J.S. Bridle, “Training Stochastic Model Recognition Algorithms as Networks Can Lead to Mutual Information Estimation of Parameters,” Advances in Neural Information Processing Systems, D.S. Touretzky, ed., vol. 2, pp. 211-217, 1990.
[5] R. Cao and L. Devroye, “The Consistency of a Smoothed Minimum Distance Estimate,” Scandinavian J. Statistics, vol. 23, no. 4, pp. 405-418, 1996.
[6] J. Chen and J.D. Kalbfleisch, “Penalized Minimum-Distance Estimates in Finite Mixture Models,” The Canadian J. Statistics, vol. 24, pp. 167-175, 1996.
[7] D. Dacunha-Castelle and E. Gassiat, “The Estimation of the Order of a Mixture Model,” Bernoulli, vol. 3, pp. 279-299, 1997.
[8] A.M. Fraser and H.L. Swinney, “Independent Coordinates for Strange Attractors from Mutual Information,” Physical Rev. A, vol. 33, pp. 1134-1139, 1986.
[9] K. Fukunaga and R.R. Hayes, “The Reduced Parzen Classifier,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11, no. 4, pp. 423-425, Apr. 1989.
[10] J. Henna, “On Estimating of the Constitutes of a Finite Mixture of Continuous Distributions,” Ann. Inst. Statistics Math., vol. 37, pp. 235-240, 1985.
[11] J. Henna, “An Estimator for the Number of Components of Finite Mixtures and Its Applications,” J. Japan Statistical Soc., vol. 18, pp. 51-64, 1988.
[12] N.L. Hjort and I.K. Glad, “Nonparametric Density Estimation with a Parametric Start,” The Annals of Statistics, vol. 23, pp. 882-904, 1995.
[13] W. Li, “Mutual Information Functions vs. Correlation Functions,” J. Statistical Physics, vol. 60, pp. 823-836, 1990.
[14] R. Linsker, “How to Generate Ordered Maps by Maximizing the Mutual Information between Input and Output Signals,” Neural Computation, vol. 1, pp. 402-411, 1989.
[15] E. Parzen, “On Estimation of a Probability Density Function and Mode,” Annals of Math. Statistics, vol. 33, pp. 1065-1076, 1962.
[16] C.E. Priebe and D.J. Marchette, “Adaptive Mixture Density Estimation,” Pattern Recognition, vol. 24, pp. 1197-1209, 1991.
[17] C.E. Priebe and D.J. Marchette, “Adaptive Mixture Density Estimation,” Pattern Recognition, vol. 26, pp. 771-785, 1993.
[18] C.E. Priebe, “Adaptive Mixtures,” J. Am. Statistical Assoc., vol. 89, pp. 769-806, 1994.
[19] K. Roeder and L. Wasserman, “Practical Bayesian Density Estimation Using Mixtures of Normals,” J. Am. Statistical Assoc., vol. 92, pp. 895-901, 1997.
[20] R. Rudzkis and M. Radavicius, “Statistical Estimation of a Mixture of Gaussian Distributions,” Acta Applicandae Mathematicae, vol. 38, pp. 37-54, 1995.
[21] H. Schioler and U. Hartmann, “Mapping Neural Network Derived from the Parzen Window Estimator,” Neural Networks, vol. 5, pp. 903-909, 1992.
[22] C.E. Shannon, “The Mathematical Theory of Communication,” Bell Systems Technique J., vol. 27, pp. 379-423, 1948.
[23] C.E. Shannon, “Prediction and Entropy of Printed English,” Bell Systems Technique J., vol. 31, pp. 50-64, 1951.
[24] D. Specht, “Probabilistic Neural Networks for Classification, Mapping, or Associative Memory,” Proc. Int'l Conf. Neural Networks, vol. 1, pp. 525-530, 1988.
[25] D.F. Specht, "Probabilistic Neural Networks," Neural Networks, vol. 3, no. 1, pp. 109-118, 1990.
[26] D.F. Specht, “A General Regression Neural Network,” IEEE Trans. Neural Networks, vol. 2, pp. 568-576, Nov. 1991.
[27] L. Xu and M.I. Jordan, “On Convergence Properties of the EM Algorithm for Gaussian Mixtures,” Neural Computation, vol. 8, pp. 129-151, 1996.
[28] Z.R. Yang, S. Chen, and H. James, “Robust Maximum Likelihood Training of Probabilistic Neural Networks,” Neural Networks, vol. 11, pp. 739-47, 1998.
[29] T.Y. Young and G. Coraluppi, “Stochastic Estimation of a Mixture of Normal Density Functions Using an Information Criterion,” IEEE Trans. Information Theory, vol. 16, pp. 258-263, 1970.

Index Terms:
Adaptive mixtures, entropy, mutual information, pattern recognition, statistical dependence, uncertainty.
Zheng Rong Yang, Mark Zwolinski, "Mutual Information Theory for Adaptive Mixture Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 4, pp. 396-403, April 2001, doi:10.1109/34.917574
Usage of this product signifies your acceptance of the Terms of Use.