This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
On the Estimation of 'Small' Probabilities by Leaving-One-Out
December 1995 (vol. 17 no. 12)
pp. 1202-1212

Abstract—In this paper, we apply the leaving-one-out concept to the estimation of ’small’ probabilities, i.e., the case where the number of training samples is much smaller than the number of possible classes. After deriving the Turing-Good formula in this framework, we introduce several specific models in order to avoid the problems of the original Turing-Good formula. These models are the constrained model, the absolute discounting model and the linear discounting model. These models are then applied to the problem of bigram-based stochastic language modeling. Experimental results are presented for a German and an English corpus.

[1] L.R. Bahl, P.F. Brown, P.V.D. Souza, and R.L. Mercer, "A Treebased Statistical Language Model for Natural Language Speech Recognition," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 37, no. 1, pp. 1,001-1,008, July 1989.
[2] L.R. Bahl,F. Jelinek,, and R.L. Mercer,“A maximum likelihood approach to continuous speech recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 5, vo. 2, pp. 179-190, Mar. 1983.
[3] T.C. Bell, J.G. Cleary, and I.H. Witten, Text Compression.Englewood Cliffs, N.J.: Prentice Hall, 1990.
[4] K.W. Church and W.A. Gale,“A comparison of the enhanced Good-Turing and deleted estimationmethods for estimating probabilities of English bigrams,” Computer, Speech, and Language, vol. 5, pp. 19-54, 1991.
[5] R. Duda, P. Hart, and D. Stork, Pattern Classification. New York: John Wiley&Sons, 2001.
[6] B. Efron,The Jackknife, the Bootstrap, and other Resampling Plans.Philadelphia: Soc. Industrial and Applied Mathematics, 1982.
[7] I.J. Good,“The population frequencies of species and the estimation of population parameters,” Biometrika 40, pp. 237-264, Dec. 1953.
[8] F. Jelinek,“Markov source modeling of text generation,” The Impact of Processing Techniques on Communication, J.K. Skwirzynski, ed. Dordrecht, The, Netherlands: Nijhoff, 1985.
[9] S. Johansson,“Word frequency and text type: Some observations based on the LOB corpus of British English texts,” Comput. Humanities, vol. 19, pp. 23-36, 1985.
[10] S.M. Katz, Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 35, no. 3, pp. 400-401, 1987.
[11] M. Kugler and M. Vehar,“Syntaktische Klassen fuer das Text Labelling,” Internal Report, ESPRIT-Projekt 291/860, Univ. Bochum, Germany, 1987.
[12] R. Kuhn and R. de Mori,“A cache-based natural language model for speech recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, pp. 570-583, June 1990.
[13] E.L. Lehmann., Theory of Point Estimation.New York: John Wiley&Sons, 1983.
[14] A. Nadas,“Estimation of probabilities in the language model of the IBM speech recognition system,” IEEE Trans. Acoustics, Speech, and Signal Proc., vol. 32, pp. 859-861, Aug. 1984.
[15] A. Nadas,“On Turing’s formula for word probabilities,” IEEE Trans Acoustics, Speech, and Signal Proc., vol. 33, pp. 1,414-1,416, Dec. 1985.
[16] V. Steinbiss,A. Noll,A. Paeseler,H. Ney,, and others, “A 10,000-word continuous speech recognition system,” Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 57-60,Albuquerque, Apr. 1990.

Index Terms:
Stochastic language modeling, leaving-one-out, zero-frequency problem, maximum likelihood estimation, generalization capability.
Citation:
Hermann Ney, Ute Essen, Reinhard Kneser, "On the Estimation of 'Small' Probabilities by Leaving-One-Out," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 12, pp. 1202-1212, Dec. 1995, doi:10.1109/34.476512
Usage of this product signifies your acceptance of the Terms of Use.