This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Incorporating Knowledge Sources Into a Statistical Acoustic Model for Spoken Language Communication Systems
September 2007 (vol. 56 no. 9)
pp. 1199-1211
This paper introduces a general framework for incorporating additional sources of knowledge into an HMMbased statistical acoustic model. Since the knowledge sources are often derived from different domains, it may be difficult to formulate a probabilistic function of the model without learning the causal-dependencies between the sources. We utilized a Bayesian network framework to solve this problem. The advantages of this graphical model framework are: (1) it allows the probabilistic relationship between information sources to be learned, and (2) it facilitates the decomposition of the joint probability density function (PDF) into a linked set of local conditional PDFs. This way, a simplified form of the model can be constructed and reliably estimated using a limited amount of training data. We applied this framework to the problem of incorporating widephonetic knowledge information, which often suffers from a sparsity of data and memory constraints. We evaluated how well the proposed method performed on an LVCSR task using English speech data that contained two different types of accents. The experimental results revealed that it improved the word accuracy with respect to standard HMM, with or without additional sources of knowledge.

[1] J. Holmes and W. Holmes, Speech Synthesis and Recognition. Taylor and Francis, 2001.
[2] W.J. Holmes and M. Huckvale, “Why Have HMMs Been So Successful for Automatic Speech Recognition and How Might They Be Improved,” Speech, Hearing, and Language, vol. 8, pp. 207-219, 1994.
[3] D.H. Klatt, Review of the ARPA Speech Understanding Project, vol. 62, pp. 1345-1366, Acoustical Soc. Am., 1977.
[4] V.W. Zue and R.A. Cole, “Experiments on Spectrogram Reading,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '79), pp. 116-119, 1979.
[5] J. Johannsen, J. MacAllister, T. Michalek, and S. Ross, “A Speech Spectrogram Expert,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '83), pp. 746-749, 1983.
[6] S.E. Levinson, “Structural Methods in Automatic Speech Recognition,” Proc. IEEE, vol. 73, pp. 1625-1650, Nov. 1985.
[7] R.P. Lippmann, “Speech Recognition by Machines and Humans,” Speech Comm., vol. 22, pp. 1-15, 1997.
[8] D. Pallett, J. Fiscuss, J. Garofolo, A. Martin, and M. Przybocki, “1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures,” Proc. DARPA Broadcast News Workshop, pp. 5-12, 1999.
[9] M. Weintraub, K. Taussig, K. Hunicke-Smith, and A. Snodgrass, “Effect of Speaking Style on LVCSR Performance,” Proc. Int'l Conf. Spoken Language Processing (ICSLP '96), pp. 16-19, 1996.
[10] J. Li, Y. Tsao, and C.-H. Lee, “A Study on Knowledge Source Integration for Candidate Rescoring in Automatic Speech Recognition,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '05), pp. 837-840, 2005.
[11] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou, “Audio-Visual Speech Recognition,” technical report, CSLP John Hopkins Univ., 2000.
[12] A. Ljolje, D. Hindle, M. Riley, and R. Sproat, “The AT&T LVCSR-2000 System,” Proc. Speech Transcription Workshop, 2000.
[13] G. Zweig and S. Russell, “Probabilistic Modeling with Bayesian Networks for Automatic Speech Recognition,” Proc. Int'l Conf. Spoken Language Processing (ICSLP '98), pp. 3010-3013, 1998.
[14] K. Daoudi, D. Fohr, and C. Antoine, “A New Approach for Multi-Band Speech Recognition Based on Probabilistic Graphical Models,” Proc. Int'l Conf. Spoken Language Processing (ICSLP '00), pp. 329-332, 2000.
[15] F. Jensen, An Introduction to Bayesian Network. UCL Press, 1998.
[16] C. Huang and A. Darwiche, “Inference in Belief Networks: A Procedural Guide,” Int'l J. Approximate Reasoning, vol. 11, pp. 1-158, 1994.
[17] K. Markov and S. Nakamura, “Forward-Backwards Training of Hybrid HMM/BN Acoustic Models,” Proc. Int'l Conf. Spoken Language Processing (ICSLP '06), 2006.
[18] K. Markov and S. Nakamura, “Modeling Successive Frame Dependencies with Hybrid HMM/BN Acoustic Model,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '05), pp.701-704, 2005.
[19] J.J. Odell, “The Use of Context in Large Vocabulary Speech Recognition,” PhD dissertation, Cambridge Univ., 1995.
[20] S. Sakti, S. Nakamura, and K. Markov, “A Hybrid HMM/BN Acoustic Model Utilizing Pentaphone-Context Dependency,” IEICE Trans. Information and Systems, vol. E89-D, no. 3, pp. 953-961, 2006.
[21] S. Sakti, S. Nakamura, and K. Markov, “Incorporation of Pentaphone-Context Dependency Based on Hybrid HMM/BN Acoustic Modeling Framework,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, 2006.
[22] J. Ming, P.O. Boyle, M. Owens, and F.J. Smith, “A Bayesian Approach for Building Triphone Models for Continuous Speech Recognition,” IEEE Trans. Speech and Audio Processing, vol. 7, no. 6, pp. 678-684, Nov. 1999.
[23] J. Ming and F. Jack Smith, “Improved Phone Recognition Using Bayesian Triphone Models,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '98), pp. 409-412, 1998.
[24] S. Sakti, S. Nakamura, and K. Markov, “Improving Acoustic Model Precision by Incorporating a Wide Phonetic Context Based on a Bayesian Framework,” IEICE Trans. Information and Systems, vol. E89-D, no. 3, pp. 946-953, 2006.
[25] S. Sakti, S. Nakamura, and K. Markov, “Incorporating a Bayesian Wide Phonetic Context Model for Acoustic Rescoring,” Proc. European Conf. Speech Comm. and Technology (EUROSPEECH '05), pp. 1629-1632, 2005.
[26] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing. Prentice Hall, 2001.
[27] T. Jitsuhiro, T. Matsui, and S. Nakamura, “Automatic Generation of Non-Uniform HMM Topologies Based on the MDL Criterion,” IEICE Trans. Information and Systems, vol. E87-D, no. 8, pp. 2121-2129, 2004.

Index Terms:
Acoustic modeling, knowledge incorporation, Bayesian network, junction tree, wide-context dependency
Citation:
Sakriani Sakti, Konstantin Markov, Satoshi Nakamura, "Incorporating Knowledge Sources Into a Statistical Acoustic Model for Spoken Language Communication Systems," IEEE Transactions on Computers, vol. 56, no. 9, pp. 1199-1211, Sept. 2007, doi:10.1109/TC.2007.1069
Usage of this product signifies your acceptance of the Terms of Use.