The Community for Technology Leaders
RSS Icon
Issue No.09 - September (2009 vol.31)
pp: 1700-1707
Kate Saenko , MIT, Cambridge
Karen Livescu , Toyota Technological Institute, Chicago
James Glass , MIT, Cambridge
Trevor Darrell , MIT, Cambridge
We study the problem of automatic visual speech recognition (VSR) using dynamic Bayesian network (DBN)-based models consisting of multiple sequences of hidden states, each corresponding to an articulatory feature (AF) such as lip opening (LO) or lip rounding (LR). A bank of discriminative articulatory feature classifiers provides input to the DBN, in the form of either virtual evidence (VE) (scaled likelihoods) or raw classifier margin outputs. We present experiments on two tasks, a medium-vocabulary word-ranking task and a small-vocabulary phrase recognition task. We show that articulatory feature-based models outperform baseline models, and we study several aspects of the models, such as the effects of allowing articulatory asynchrony, of using dictionary-based versus whole-word models, and of incorporating classifier outputs via virtual evidence versus alternative observation models.
Visual speech recognition, articulatory features, dynamic Bayesian networks, support vector machines.
Kate Saenko, Karen Livescu, James Glass, Trevor Darrell, "Multistream Articulatory Feature-Based Models for Visual Speech Recognition", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.31, no. 9, pp. 1700-1707, September 2009, doi:10.1109/TPAMI.2008.303
[1] J. Bilmes, “On Soft Evidence in Bayesian Networks,” Technical Report UWEETR-2004-00016, Electrical Eng. Dept., Univ. of Washington, 2004.
[2] J. Bilmes, “The Graphical Models Toolkit,” gmtk/, 2009.
[3] J.A. Bilmes and C. Bartels, “Graphical Model Architectures for Speech Recognition,” IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 89-100, Sept. 2005.
[4] C.P. Browman and L. Goldstein, “Articulatory Phonology: An Overview,” Phonetica, vol. 49, nos. 3/4, pp. 155-180, 1992.
[5] O. Cetin et al., “An Articulatory Feature-Based Tandem Approach and Factored Observation Modeling” Proc. Int'l Conf. Acoustics, Speech, and Signal Proc., pp. IV-645-IV-648, Apr. 2007.
[6] C.-C. Chang and C.-J. Lin, “LIBSVM A Library for Support Vector Machines,”, 2001.
[7] T. Dean and K. Kanazawa, “A Model for Reasoning About Persistence and Causation,” Computational Intelligence, vol. 5, no. 2, pp. 142-150, Feb. 1989.
[8] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. Series B, vol. 39, no. 1, pp. 1-38, 1977.
[9] L. Deng, G. Ramsay, and D. Sun, “Production Models as a Structural Basis for Automatic Speech Recognition,” Speech Comm., vol. 22, nos. 2/3, pp. 93-111, Aug. 1997.
[10] M. Gordan, C. Kotropoulos, and I. Pitas, “A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications,” EURASIP J. Applied Signal Processing, vol. 2002, no. 11, pp. 1248-1259, 2002.
[11] G. Gravier, G. Potamianos, and C. Neti, “Asynchrony Modeling for Audio-Visual Speech Recognition” Proc. Human Language Technology Conf., p. 1006, Mar. 2002.
[12] M. Hasegawa-Johnson, K. Livescu, P. Lal, and K. Saenko, “Audiovisual Speech Recognition with Articulator Positions as Hidden Variables,” Proc. Int'l Congress on Phonetic Sciences, Aug. 2007.
[13] T.J. Hazen, K. Saenko, C.-H. La, and J.R. Glass, “A Segment-Based Audio-Visual Speech Recognizer: Data Collection, Development, and Initial Experiments” Proc. Int'l Conf. Multimodal Interfaces, pp. 235-242, Oct. 2004.
[14] F. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 1998.
[15] S. King et al., “Speech Production Knowledge in Automatic Speech Recognition,” J. Acoustical Soc. of Am., vol. 121, no. 2, pp. 723-742, Feb. 2007.
[16] S. King and P. Taylor, “Detection of Phonological Features in Continuous Speech Using Neural Networks,” Computer Speech and Language, vol. 14, no. 4, pp. 333-353, Oct. 2000.
[17] K. Kirchhoff, G.A. Fink, and G. Sagerer, “Combining Acoustic and Articulatory Feature Information for Robust Speech Recognition,” Speech Comm., vol. 37, nos. 3/4, pp. 303-319, July 2002.
[18] G. Krone, B. Talle, A. Wichert, and G. Palm, “Neural Architectures for Sensor Fusion in Speech Recognition,” Proc. European Speech Comm. Assoc. Workshop Audio-Visual Speech Processing, pp. 57-60, Sept. 1997.
[19] K. Livescu and J. Glass, “Feature-Based Pronunciation Modeling for Speech Recognition,” Proc. Human Language Technology Conf. North Am. Chapter of the Assoc. for Computational Linguistics, May 2004.
[20] K. Livescu and J. Glass, “Feature-Based Pronunciation Modeling with Trainable Asynchrony Probabilities” Proc. Int'l Conf. Spoken Language, pp.677-680, Oct. 2004.
[21] K. Livescu et al., “Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: JHU Summer Workshop Final Report,” Johns Hopkins Univ., Center for Language and Speech Processing, 2007.
[22] H. McGurk and J. McDonald, “Hearing Lips and Seeing Voices,” Nature, vol. 264, no. 5588, pp. 746-748, Dec. 1976.
[23] N. Morgan and H. Bourlard, “Continuous Speech Recognition,” IEEE Signal Processing Magazine, vol. 12, no. 3, pp. 24-42, May 1995.
[24] K. Murphy, “Dynamic Bayesian Networks: Representation, Inference and Learning,” PhD dissertation, Computer Science Division, Univ. of California, 2002.
[25] A.V. Nefian, L. Liang, X. Pi, L. Xiaoxiang, C. Mao, and K. Murphy, “A Coupled HMM for Audio-Visual Speech Recognition” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 2013-2016, May 2002.
[26] P. Niyogi, E. Petajan, and J. Zhong, “A Feature Based Representation for Audio Visual Speech Recognition,” Proc. Int'l Conf. Auditory-Visual Speech Processing, Aug. 1999.
[27] H. Nock and S. Young, “Modelling Asynchrony in Automatic Speech Recognition Using Loosely Coupled Hidden Markov Models,” Cognitive Science, vol. 26, no. 3, pp. 283-301, May/June 2002.
[28] H. Pan, S.E. Levinson, T.S. Huang, and Z. Liang, “A Fused Hidden Markov Model with Application to Bimodal Speech Processing,” IEEE Trans. Signal Processing, vol. 52, no. 3, pp. 573-581, Mar. 2004.
[29] J. Pearl, Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
[30] E. Petajan, “Automatic Lipreading to Enhance Speech Recognition,” Proc. Global Telecomm. Conf., pp. 265-272, 1984.
[31] J. Platt, “Probabilities for SV Machines,” Advances in Large Margin Classifiers, A.J. Smola, P.L. Bartlett, B. Schoelkopf, and D. Schuurmans, eds., pp. 61-73, MIT Press, 2000.
[32] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior, “Recent Advances in the Automatic Recognition of Audiovisual Speech,” Proc. IEEE Int'l Conf. Image Processing, vol. 91, no. 9, pp. 1306-1326, Sept. 2003.
[33] M. Richardson, J. Bilmes, and C. Diorio, “Hidden Articulator Markov Models for Speech Recognition,” Speech Comm., vol. 41, nos. 2/3, pp. 511-529, Oct. 2003.
[34] K. Saenko, T. Darrell, and J.R. Glass, “Articulatory Features for Robust Visual Speech Recognition” Proc. Int'l Conf. Multimodal Interfaces, pp. 152-158, Oct. 2004.
[35] K. Saenko, K. Livescu, J. Glass, and T. Darrell, “Production Domain Modeling of Pronunciation for Visual Speech Recognition,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. v/473-v/476, Mar. 2005.
[36] K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T. Darrell, “Visual Speech Recognition with Loosely Synchronized Feature Streams” Proc. Int'l Conf. Computer Vision, pp. 1424-1431, Oct. 2005.
[37] V. Zue, S. Seneff, and J. Glass, “Speech Database Development: TIMIT and Beyond,” Speech Comm., vol. 9, no. 4, pp. 351-356, Aug. 1990.
[38] G. Zweig, “Speech Recognition Using Dynamic Bayesian Networks,” PhD dissertation, Computer Science Division, Univ. of California, 1998.
22 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool