The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - Jan. (2014 vol.36)
pp: 1
Ziheng Zhou , University of Oulu, Oulu
Xiaopeng Hong , University of Oulu, Oulu
Guoying Zhao , University of Oulu, Oulu
Matti Pietikainen , University of Oulu, Oulu
ABSTRACT
The problem of visual speech recognition involves the decoding of the video dynamics of a talking mouth in a high-dimensional visual space. In this paper, we propose a generative latent variable model to provide a compact representation of visual speech data. The model uses latent variables to separately represent the inter-speaker variations of visual appearances and those caused by uttering, and incorporates the structural information of the observed visual data within an utterance through modelling the structure using a path graph and placing variables' priors along its embedded curve.
INDEX TERMS
Visualization, Hidden Markov models, Image sequences, Mouth, Speech, Speech recognition, Data models, Pattern analysis, Visualization, Hidden Markov models, Image sequences, Mouth, Speech, Speech recognition, Data models, Computer vision, Representations, data structures, and transforms
CITATION
Ziheng Zhou, Xiaopeng Hong, Guoying Zhao, Matti Pietikainen, "A Compact Representation of Visual Speech Data Using Latent Variables", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.36, no. 1, pp. 1, Jan. 2014, doi:10.1109/TPAMI.2013.173
REFERENCES
[1] M. Belkin and P. Niyogi, "Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering," Proc. Advances in Neural Information Processing Systems, pp. 585-591, 2001.
[2] C. Bregler and Y. Konig, "'Eigenlips' for Robust Speech Recognition," Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 669-672, 1994.
[3] F. Chung, Spectral Graph Theory (CBMS Regional Conference Series in Mathematics), Am. Math. Soc., 1996.
[4] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-End Factor Analysis for Speaker Verification," IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, May 2011.
[5] A. Dempster, N. Laird, and D. Rubin, "Maximum Likelihood for Incomplete Data via the EM Algorithm," J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977.
[6] S. Dupont and J. Luettin, "Audio-Visual Speech Modeling for Continuous Speech Recognition," IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141-151, Sep. 2000.
[7] R. Fergus, P. Perona, and A. Zisserman, "Object Class Recognition by Unsupervised Scale Invariant Learning," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 264-271, 2003.
[8] M. Gales, "Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition," Computer Speech & Language, vol. 12, no. 2, pp. 75-98, 1998.
[9] J. Gowdy, A. Subramanya, C. Bartels, and J. Bilmes, "DBN Based Multi-Stream Models for Audio-Visual Speech Recognition," Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 993-996, 2004.
[10] A. Kanaujia, C. Sminchisescu, and D. Metaxas, "Spectral Latent Variable Models for Perceptual Inference," Proc. 11th IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[11] Y. Lan, R. Harvey, B. Theobald, E. Ong, and R. Bowden, "Comparing Visual Features for Lipreading," Proc. Int'l Conf. Auditory-Visual Speech Processing, pp. 102-106, 2009.
[12] Y. Lan, B. Theobald, R. Harvey, E. Ong, and R. Bowden, "Improving Visual Features for Lip-Reading," Proc. Int'l Conf. Auditory-Visual Speech Processing, pp. 142-147, 2010.
[13] L. Lee and R. Rose, "A Frequency Warping Approach to Speaker Normalization," IEEE Trans. Speech Audio Processing, vol. 6, no. 1, pp. 49-60, Jan. 1998.
[14] P. Li, Y. Fu, U. Mohammed, J. Elder, and S. Prince, "Probabilistic Models for Inference about Identity," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 1, pp. 144-157, Jan. 2012.
[15] I. Matthews, T. Cootes, J. Bangham, S. Cox, and R. Harvey, "Extraction of Visual Features for Lipreading," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198-213, Feb. 2002.
[16] H. McGurk and J. MacDonald, "Hearing Lips and Seeing Voices," Nature, vol. 264, no. 5588, pp. 746-748, 1976.
[17] A. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, "Dynamic Bayesian Networks for Audio-Visual Speech Recognition," EURASIP J. Applied Signal Processing, vol. 2002, no. 1, pp. 1274-1288, 2002.
[18] T. Ojala, M. Pietikäinen, and T. Mäenpää, "Multiresolution Gray Scale and Rotation Invariant Texture Classification with Local Binary Patterns," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971-987, July 2002.
[19] P. Phillips, H. Moon, S. Rizvi, and P. Rauss, "The FERET Evaluation Methodology for Face-Recognition Algorithms," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 10, pp. 1090-1104, Oct. 2000.
[20] G. Potamianos, C. Neti, and G. Gravier, "Recent Advances in the Automatic Recognition of Audio-Visual Speech," Proc. IEEE, vol. 91, no. 9, pp. 1306-1326, Sept. 2003.
[21] G. Potamianos, C. Neti, G. Iyengar, A. Senior, and A. Verma, "A Cascade Visual Front End for Speaker Independent Automatic Speechreading," Int'l J. Speech Technology, vol. 4, pp. 193-208, 2001.
[22] L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. IEEE, vol. 77, no. 2, pp. 257-286, Feb. 1989.
[23] K. Saenko, K. Livescu, J. Glass, and T. Darrell, "Multistream Articulatory Feature-Based Models for Visual Speech Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 9, pp. 1700-1707, Sept. 2009.
[24] Y. Tian, L. Sigal, H. Badino, F. De la Torre Frade, and Y. Liu, "Latent Gaussian Mixture Regression for Human Pose Estimation," Proc. Asian Conf. Computer Vision, vol. 3, pp. 679-690, 2010.
[25] G. Zhao, M. Barnard, and M. Pietikäinen, "Lipreading with Local Spatiotemporal Descriptors," IEEE Trans. Multimedia, vol. 11, no. 7, pp. 1254-1265, Nov. 2009.
[26] G. Zhao and M. Pietikäinen, "Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 915-928, June 2007.
[27] Z. Zhou, G. Zhao, and M. Pietikäinen, "Towards a Practical Lipreading System," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 137-144, 2011.
41 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool