This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
On the Probabilistic Interpretation of Neural Network Classifiers and Discriminative Training Criteria
February 1995 (vol. 17 no. 2)
pp. 107-119

Abstract—A probabilistic interpretation is presented for two important issues in neural network based classification, namely the interpretation of discriminative training criteria and the neural network outputs as well as the interpretation of the structure of the neural network. The problem of finding a suitable structure of the neural network can be linked to a number of well established techniques in statistical pattern recognition, such as the method of potential functions, kernel densities, and continuous mixture densities. Discriminative training of neural network outputs amounts to approximating the class or posterior probabilities of the classical statistical approach. This paper extends these links by introducing and analyzing novel criteria such as maximizing the class probability and minimizing the smoothed error rate. These criteria are defined in the framework of class-conditional probability density functions. We will show that these criteria can be interpreted in terms of weighted maximum likelihood estimation, where the weights depend in a complicated nonlinear fashion on the model parameters to be trained. In particular, this approach covers widely used techniques such as corrective training, learning vector quantization, and linear discriminant analysis.

[1] M. A. Aizerman,E. M. Braverman,L. I. Rozonoer,“Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning,” Automation and Remote Control, vol. 25, pp. 821-837, June. 1964; “The Probability Problem of Pattern Recognition Learning and the Method of Potential Functions,” Automation and Remote Control, vol. 25, pp. 1175-1193, Sept. 1964.
[2] H. Asoh,N. Otsu,“Nonlinear Data Analysis and Multilayer Perceptrons,” Proc. IEEE Int. Joint Conf. on Neural Networks,Washington, DC, vol. II, pp. 411-415, June 1989.
[3] J. K Baker,“Stochastic Modeling for Automatic Speech Understanding,” Reddy, D.R. (ed.): Speech Recognition, Academic Press, New York, pp. 512-542, 1975.
[4] L. R. Bahl,P. F. Brown,P.V. de Souza,R. L. Mercer,“Maximum Mutual Information Estimation of Hidden Markov Model Parameters,” Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc.,Tokyo, pp. 49-52, April 1986.
[5] L. R. Bahl,P. F. Brown,P.V. de Souza,R. L. Mercer,“A New Algorithm for the Estimation of Hidden Markov Model Parameters,” Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc.,New York, NY, pp. 493-497, April 1988.
[6] H. Bourlard,Y. Kamp,“Auto-Association by Multilayer Perceptrons and Singular Value Decomposition,” Biological Cybernetics, vol. 59, pp. 291-294, 1988.
[7] C. C. Blaydon,“On a Pattern Recognition Result of Aizerman, Braverman and Rozonoer,” IEEE Trans. on Information Theory, vol. IT-12, pp. 82-83, Jan. 1966.
[8] J. S. Bridle,“Probabilistic Interpretation of Feedforward Classification Network Outputs with Relationships to Statistical Pattern Recognition,” F. Fogelman-Soulie, J. Herault (eds.): Neuro-computing: Algorithms, Architectures and Applications, NATO ASI Series in Systems and Computer Science, Springer, 1989.
[9] H. Bourlard,C. J. Wellekens,“Links between Markov Models and Multilayer Perceptrons,” D.S. Touretzky (ed.): “Advances in Neural Information Processing Systems I”, Morgan Kaufmann Pub., San Mateo, CA, pp. 502-507, 1989.
[10] R. O. Duda,P. E. Hart,“Pattern Classification and Scene Analysis”,John Wiley&Sons, New York, 1973.
[11] A. El-Jaroudi,J. Makhoul,“A New Error Criterion for Posterior Probability Estimation with Neural Nets,” Proc. IEEE Int. Joint Conf. on Neural Networks,Washington, DC, vol. III, pp. 185-192, June 1990.
[12] K. Fukunaga, Introduction to Statistical Pattern Recognition, second edition. Academic Press, 1990.
[13] H. Gish,“A Probabilistic Approach to the Understanding and Training of Neural Network Classifiers,” Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc.,Albuquerque, NM, pp. 1361-1365, April 1990.
[14] P. Gallinari,S. Thiria,F. Badran,F. Fogelman-Soulie,“On the Relations between Discriminant Analysis and Multilayer Perceptrons,” Neural Networks, vol. 4, pp. 349-360, 1991.
[15] D. J. Hand,“Kernel Discriminant Analysis”, Research Studies Press, Wiley, Chichester, 1982.
[16] J. B. Hampshire,A. H. Waibel,“A Novel Objective Function for Improved Phoneme Recognition Using Time Delay Neural Networks,” Proc. IEEE Int. Joint Conf. on Neural Networks,Washington, DC, vol. I, pp. 235-241, June 1989.
[17] F. Jelinek,“Speech Recognition by Statistical Methods,” Proc. of the IEEE, vol. 64, pp. 532-556, April 1976.
[18] B. H. Juang,S. Katagiri,“Discriminative Learning for Minimum Error Classification,” IEEE Trans. on Signal Processing, vol. 40, pp. 3043-3054, Dec. 1992.
[19] T. Kohonen, "Self-Organization and Associated Memory," Berlin Heidelberg. New York: Springer-Verlag, 1988.
[20] T. Kohonen,G. Barna,R. Chrisley,“Statistical Pattern Recognition with Neural Networks: Benchmark Studies,” Proc. IEEE Int. Conf. on Neural Networks,San Diego, CA, pp. I-61-68, July 1988.
[21] A. Ljolje,Y. Ephraim,L. R. Rabiner,“Estimation of Hidden Markov Model Parameters by Minimizing Empirical Error Rate,” Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc.,Albuquerque, NM, pp. 709-712, April 1990.
[22] K. -F. Lee,S. Mahajan,“Corrective and Reinforcement Learning for Speaker-Independent Continuous Speech Recognition,” Computer Science Dept., Carnegie-Mellon University, Pittsburgh, PA, Rep. CMU-CS-89-100, Jan. 1989.
[23] D. Lowe,A. R. Webb,“Optimized Feature Extraction and the Bayes Decision in Feed-Forward Classifier Networks,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 13, pp. 355-364, April 1991.
[24] A. Nadas,“Optimal Solution of a Training Problem in Speech Recognition,” IEEE Trans. on Acoustics, Speech and Signal Proc., vol. ASSP-33, pp. 326-329, Feb. 1985.
[25] L. T. Niles,H. F. Silverman,M. A. Bush,“Neural Networks, Maximum Mutual Information Training, and Maximum Likelihood Training,” Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc.,Albuquerque, NM, pp. 493-496, April 1990.
[26] T. Poggio and F. Girosi, Networks for Approximation and Learning Proc. IEEE, vol. 78, pp. 1481-1497, 1990.
[27] J. D. Patterson,B. F. Womack,“An Adaptive Pattern Classification Scheme,” IEEE Trans. on Systems, Science and Cybernetics, vol. SSC-2, pp. 62-67, Aug. 1966.
[28] C. R. Rao,“Linear Statistical Inference and its Applications,”John Wiley&Sons, New York, 1965.
[29] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, "Learning Internal Representations by Error Propagation," Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1: Foundations, D.E. Rumelhart and J.L. McClelland et al., eds., chapter 8, pp. 318-362.Cambridge, Mass.: MIT Press, 1986.
[30] M. D. Richard,R. P. Lippmann,“Neural Network Classifiers Estimate Bayesian a posteriori Probabilities,” Neural Computation, vol. 3, pp. 461-483, 1991.
[31] S. Renals,R. Rohwer,“Phoneme Classification Experiments Using Radial Basis Functions,” Proc. IEEE Int. Joint Conf. on Neural Networks,Washington, DC, vol. I., pp. 461-467, June 1989.
[32] S. A. Solla,E. Levin,M. Fleisher,M:“Accelerated Learning in Layered Neural Networks,” Complex Systems, vol. 2, pp. 625-639, 1988.
[33] D.F. Specht, "Probabilistic Neural Networks," Neural Networks, vol. 3, no. 1, pp. 109-118, 1990.
[34] N. Tishby,S. Levin,S. A. Solla,“Consistent Inference of Probabilities in Layered Neural Networks,” Proc. IEEE Int. Joint Conf. on Neural Networks,Washington, DC, vol. II, pp. 403-4o9, June 1989.
[35] A. J. Viterbi,J. K. Omura,“Principles of Digital Communication and Coding,”McGraw-Hill, Tokyo, 1979.
[36] A. Waibel,T. Hanazawa,G. Hinton,K. Shikano,K. L. Lang,“Phoneme Recognition: Neural Networks vs. Hidden Markov Models,” Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc.,New York, NY, pp. 107-110, April 1988.
[37] H. White,“Learning in Artificial Neural Networks: A Statistical Perspective,” Neural Computation, vol. 1, pp. 425-464, 1990.
[38] A. R. Webb,D. Lowe,“The Optimised Internal Representation of Multilayer Classifier Networks Performs Nonlinear Discriminant Analysis,” Neural Networks, vol. 3, pp. 367-375, 1990.

Index Terms:
Statistical pattern recognition, neural networks, discriminant functions, training criteria, speech recognition.
Citation:
Hermann Ney, "On the Probabilistic Interpretation of Neural Network Classifiers and Discriminative Training Criteria," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 2, pp. 107-119, Feb. 1995, doi:10.1109/34.368176
Usage of this product signifies your acceptance of the Terms of Use.