This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Extraction of Visual Features for Lipreading
February 2002 (vol. 24 no. 2)
pp. 198-213

Abstract—The multimodal nature of speech is often ignored in human-computer interaction, but lip deformations and other body motion, such as those of the head, convey additional information. We integrate speech cues from many sources and this improves intelligibility, especially when the acoustic signal is degraded. This paper shows how this additional, often complementary, visual speech information can be used for speech recognition. Three methods for parameterizing lip image sequences for recognition using hidden Markov models are compared. Two of these are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape or shape and appearance, respectively. The third, bottom-up, method uses a nonlinear scale-space analysis to form features directly from the pixel intensity. All methods are compared on a multitalker visual speech recognition task of isolated letters.

[1] A. Adjoudani and C. Benoît, “On the Integration of Auditory and Visual Parameters in an HMM-Based ASR,” Speedreading by Humans and Machines: Models, Systems, and Applications, vol. 150, pp. 461–471, 1996.
[2] B. Atal and L. Hanauer, “Speech Analysis and Synthesis by Linear Prediction of the Speech Wave,” J. Acoustical Soc. of America, vol. 50, pp. 637-655, 1971.
[3] J.A. Bangham, P. Chardaire, C.J. Pye, and P.D. Ling, “Mulitscale Nonlinear Decomposition: The Sieve Decomposition Theorem,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 5, pp. 529-539, May 1996.
[4] J.A. Bangham, R. Harvey, P. Ling, and R.V. Aldridge, “Morphological Scale-Space Preserving Transforms in Many Dimensions,” J. Electronic Imaging, vol. 5, no. 3, pp. 283-299, July 1996.
[5] J.A. Bangham, R. Harvey, P. Ling, and R.V. Aldridge, “Nonlinear Scale-Space fromn-Dimensional Sieves,” Proc. European Conf. Computer Vision, vol. 1, pp. 189-198, 1996.
[6] J.A. Bangham, S.J. Impey, and F.W.D. Woodhams, “A Fast 1D Sieve Transform for Multiscale Signal Decomposition,” Proc. European Signal Processing Conf., pp. 1621-1624, 1994.
[7] J.A. Bangham, P. Ling, and R. Harvey, “Scale-Space from Nonlinear Filters,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 5, pp. 520-528, May 1996.
[8] J.A. Bangham, P. Ling, and R. Young, “Mulitscale Recursive Medians, Scale-Space, and Transforms with Applications to Image Processing,” IEEE Trans. Image Processing, vol. 5, no. 6, pp. 1043-1048, 1996.
[9] S. Basu, N. Oliver, and A. Pentland, “3D Modeling and Tracking of Human Lip Motions,” Proc. Int'l Conf. Computer Vision, 1998.
[10] Proc. ESCA Workshop Audio-Visual Speech Processing, C. Benoît and R. Campbell, eds., Rhodes, Sept. 1997.
[11] S. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 27, pp. 113-120, 1979.
[12] A. Bosson and R. Harvey, “Using Occlusion Models to Evaluate Scale Space Processors,” Proc. IEEE Int'l Conf. Image Processing, 1998.
[13] C. Bregler, H. Hild, S. Manke, and A. Waibel, “Improving Connected Letter Recognition by Lipreading,” Proc. Int'l Conf. Acoustics, Speech and Signal Processing, vol. 1, pp. 557-560, 1993.
[14] C. Bregler and Y. Konig, “‘Eigenlips’for Robust Speech Recognition,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 669-672, 1994.
[15] C. Bregler and S.M. Omohundro, “Learning Visual Models for Lipreading,” Computational Imaging and Vision, chapter 13, vol. 9, pp. 301-320, 1997.
[16] C. Bregler, S.M. Omohundro, and J. Shi, “Towards a Robust Speechreading Dialog System,” NATO ASI Series F: Computer and Systems Sciences, pp. 409-423, Sept. 1996.
[17] N.M. Brooke and S.D. Scott, “PCA Image Coding Schemes and Visual Speech Intelligibility,” Proc. Inst. of Acoustics, vol. 16, no. 5, pp. 123-129, 1994.
[18] N.M. Brooke, M.J. Tomlinson, and R.K. Moore, “Automatic Speech Recognition that Includes Visual Speech Cues,” Proc. Inst. of Acoustics, vol. 16, no. 5, pp. 15-22, 1994.
[19] J. Bulwer, Philocopus, or the Deaf and Dumbe Mans Friend. Humphrey and Moseley, 1648.
[20] Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory-Visual Speech, R. Campbell, B. Dodd, and D. Burnham, eds., Psychology Press, 1998.
[21] C. Chatfield and A. J. Collins, Introduction to Multivariate Analysis. Chapman and Hall, 1991.
[22] T. Chen and R.R. Rao, “Audio-Visual Integration in Multimodal Communication,” Proc. IEEE, vol. 86, no. 5, pp. 837-852, May 1998.
[23] C.C. Chibelushi, S. Gandon, J.S.D. Mason, F. Deravi, and R.D. Johnston, “Design Issues for a Digital Audio-Visual Integrated Database,” IEE Colloquium on Integrated Audio-Visual Processing, number 1996/213, pp. 7/1-7/7, Nov. 1996.
[24] T. Coianiz, L. Torresani, and B. Caprile, “2D Deformable Models for Visual Speech Analysis,” IEEE Trans. Speech and Audio Processing, pp. 391-398, Sept. 1996.
[25] T. Cootes, G.J. Edwards, and C. Taylor, “Comparing Active Shape Models with Active Appearance Models,” Proc. British Machine Vision Conf., vol. 1, pp. 173-183, 1999.
[26] T.F. Cootes, G.J. Edwards, and C.J. Taylor, “Active Appearance Models,” Proc. European Conf. Computer Vision, pp. 484-498, June 1998.
[27] T.F. Cootes, A. Hill, C.J. Taylor, and J. Haslam, “The Use of Active Shape Models for Locating Structures in Medical Images,” Image and Vision Computing, vol. 12, no. 6, pp. 355-366, 1994.
[28] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, "Active Shape Models—Their Training and Application," Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38-59, Jan. 1995.
[29] T.F. Cootes, C.J. Taylor, and A. Lanitis, “Active Shape Models: Evaluation of a Multiresolution Method for Improving Image Search,” Proc. British Machine Vision Conf., E. Hancock, ed., pp. 327-336, 1994.
[30] S. Cox, I. Matthews, and A. Bangham, “Combining Noise Compensation with Visual Information in Speech Recognition,” Proc. ESCA Workshop Audio-Visual Speech Processing, pp. 53-56, 1997.
[31] B. Dautrich, L. Rabiner, and T. Martin, “On the Effects of Varying Filter Bank Parameters on Isolated Word Recognition,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 31, pp. 793-807, Aug. 1983.
[32] S. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, pp. 357-366, 1980.
[33] P. Duchnowski, M. Hunke, D. Büsching, U. Meier, and A. Waibel, “Toward Movement-Invariant Automatic Lip-Reading and Speech Recognition,” Proc. Int'l Conf. Spoken Language Processing, pp. 109-112, 1995.
[34] G.J. Edwards, T.F. Cootes, and C.J. Taylor, “Face Recognition Using Active Appearance Models,” Proc. European Conf. Computer Vision, pp. 582-595, June 1998.
[35] G.J. Edwards, C. Taylor, and T.F. Cootes, “Interpreting Face Images Using Active Appearance Models,” Proc. Third Int'l Conf. Automatic Face and Gesture Recognition, pp. 300-305, 1998.
[36] N.P. Erber, “Auditory-Visual Perception of Speech,” J. Speech and Hearing Disorders, vol. 40, pp. 481-492, 1975.
[37] S. Furui, “Speaker Independent Isolated Word Recognition Using Dynamic Features of the Speech Spectrum,” IEEE Trans. Acoustics, Speech, and Signal Processing, 1984.
[38] A.J. Goldschen, “Continuous Automatic Speech Recognition by Lipreading,” PhD thesis, George Washington Univ., 1993.
[39] A.J. Goldschen, O.S. Garcia, and E.D. Petajan, “Continuous Automatic Speech Recognition by Lipreading,” Computational Imaging and Vision, chapter 14, pp. 321-343, 1997.
[40] K.P. Green, “The Use of Auditory and Visual Information During Phonetic Processing: Implications for Theories of Speech Perception,” Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory-Visual Speech, pp. 3-25, 1998.
[41] R. Harvey, A. Bosson, and J.A. Bangham, “Robustness of Some Scale-Spaces,” Proc. British Machine Vision Conf., vol. 1, pp. 11-20, 1997.
[42] R. Harvey, I. Matthews, J.A. Bangham, and S. Cox, “Lip Reading from Scale-Space Measurements,” Proc. Conf. Computer Vision and Pattern Recognition pp. 582-587, June 1997.
[43] J. Haslam, C.J. Taylor, and T.F. Cootes, “A Probabilistic Fitness Measure for Deformable Template Models,” Proc. British Machine Vision Conf., pp. 33-42, 1994.
[44] M.E. Hennecke, “Audio-Visual Speech Recognition: Preprocessing, Learning and Sensory Integration,” PhD thesis, Stanford Univ., Sept. 1997.
[45] M.E. Hennecke, D.G. Stork, and K.V. Prasad, “Visionary Speech: Looking Ahead to Practical Speechreading Systems,” NATO ASI Series F: Computer and Systems Science, pp. 331-349, 1996.
[46] A. Hill and C.J. Taylor, “Automatic Landmark Generation for Point Distribution Models,” Proc. British Machine Vision Conf., pp. 429-438, 1994.
[47] A. Holmes and C. Taylor, “Developing a Measure of Similarity between Pixel Signatures,” Proc. British Machine Vision Conf., vol. 2, pp. 614-623, 1999.
[48] R. Kaucic and A. Blake, “Accurate, Real-Time, Unadorned Lip Tracking,” Proc Sixth Int'l Conf. Computer Vision, 1998.
[49] R. Kaucic, B. Dalton, and A. Blake, “Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications,” Proc. European Conf. Computer Vision, B. Buxton and R. Cipolla, eds., pp. 376-387, Apr. 1996.
[50] J.J. Koenderink, “The Structure of Images,” Biological Cybernetics, vol. 50, pp. 363-370, 1984.
[51] S.E. Levinson, L.R. Rabiner, and M.M. Sondhi, “An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition,” The Bell System Technical J., vol. 62, no. 4, pp. 1035-1074, Apr. 1983.
[52] N. Li, S. Dettmer, and M. Shah, “Visually Recognizing Speech Using Eigensequences,” Computational Imaging and Vision, chapter 15, pp. 345-371, 1997.
[53] T. Lindeberg, Scale-Space Theory in Computer Vision. Kluwer Academic, 1994.
[54] J. Luettin, “Visual Speech and Speaker Recognition,” PhD thesis, Univ. of Sheffield, May 1997.
[55] J. Luettin and N.A. Thacker, “Speechreading Using Probabilistic Models,” Computer Vision and Image Understanding, vol. 65, no. 2, pp. 163-178, Feb. 1997.
[56] J. Luettin, N.A. Thacker, and S.W. Beet, “Speechreading Using Shape and Intensity Information,” Proc. Fourth Int'l Conf. Spoken Language Processing (ICSLP '96), vol. 1, pp. 58-61, 1996.
[57] J. MacDonald and H. McGurk, “Visual Influences on Speech Perception Processes,” Perception and Psychophysics, vol. 24, pp. 253-257, 1978.
[58] K. Mase and A. Pentland, “Automatic Lipreading by Optical-Flow Analysis,” Systems and Computers in Japan, vol. 22, no. 6, pp. 67-75, 1991.
[59] G. Matheron, Random Sets and Integral Geometry. Wiley, 1975.
[60] I. Matthews, “Features for Audio-Visual Speech Recognition,” PhD thesis, School of Information Systems, Univ. East Anglia, Oct. 1998.
[61] I. Matthews, J.A. Bangham, R. Harvey, and S. Cox, “A Comparison of Active Shape Model and Scale Decomposition Based Features for Visual Speech Recognition,” Proc. European Conf. Computer Vision, pp. 514-528, June 1998.
[62] H. McGurk and J. McDonald, “Hearing Lips and Seeing Voices,” Nature, vol. 264, pp. 746-748, Dec. 1976.
[63] U. Meier, R. Stiefelhagen, and J. Yang, “Preprocessing of Visual Speech Under Real World Conditions,” Proc. ESCA Workshop Audio-Visual Speech Processing, pp. 113-116, Sept. 1997.
[64] K. Morovec, R.W. Harvey, and J.A. Bangham, “Scale-Space Trees and Applications as Filters, for Stereo Vision and Image Retrieval,” Proc. British Machine Vision Conf., vol. 1, pp. 113-122, 1999.
[65] J.R. Movellan and G. Chadderdon, “Channel Seperability in the Audio Visual Integration of Speech: A Bayesian Approach,” NATO ASI Series F: Computer and Systems Science, pp. 473-487, 1996.
[66] K.K. Neely, “Effect of Visual Factors on the Intelligibility of Speech,” J. Acoustical Soc. of America, vol. 28, no. 6, pp. 1275-1277, Nov. 1956
[67] J.A. Nelder and R. Mead, “A Simplex Method for Function Minimization,” Computing J., vol. 7, no. 4, pp. 308-313, 1965.
[68] J.J. O'Neill, “Contributions of the Visual Components of Oral Symbols to Speech Comprehension,” J. Speech and Hearing Disorders, vol. 19, pp. 429-439, 1954.
[69] E. Petajan and H.P. Graf, “Robust Face Feature Analysis for Automatic Speechreading and Character Animation,” NATO ASI Series F: Computer and Systems Science, pp. 425-436, 1996.
[70] E.D. Petajan, “Automatic Lipreading to Enhance Speech Recognition,” PhD thesis, Univ. of Illinois, Urbana-Champaign, 1984.
[71] E.D. Petajan, B.J. Bischoff, D.A. Bodoff, and N.M. Brooke, “An Improved Automatic Lipreading System to Enhance Speech Recognition,” Technical Report TM 11251-871012-11, AT&T Bell Labs, Oct. 1987.
[72] I. Pitas and A.N. Venetsanopoulos, “Morphological Shape Decomposition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, no. 1, pp 38-45, Jan. 1990.
[73] G. Potamianos, F. Cosatto, H.P. Graf, and D.B. Roe, “Speaker Independent Audio-Visual Database for Bimodal ASR,” Proc. ESCA Workshop Audion-Visual Speech Processing, pp. 65-68, Sept. 1997.
[74] C.A. Poynton, A Technical Introduction to Digital Video. John Wiley&Sons, 1996.
[75] M.U.R. Sánchez, J. Matas, and J. Kittler, “Statistical Chromaticity-Based Lip Tracking with B-Splines,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, Apr. 1997.
[76] J.-L. Schwartz, J. Robert-Ribes, and P. Escudier, “Ten Years after Summerfield: A Taxonomy of Models for Audio-Visual Fusion in Speech Perception,” Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory-Visual Speech, pp. 85-108, 1998.
[77] “Motion-Based Recognition,” Computational Imaging and Vision, M. Shah and R. Jain, eds., vol. 9,Kluwer Academic, 1997.
[78] P.L. Silsbee, “Motion in Deformable Templates,” Proc. IEEE Int'l Conf. Image Processing, vol. 1, pp. 323-327, 1994.
[79] P.L. Silsbee, “Computer Lipreading for Improved Accuracy in Automatic Speech Recognition,” IEEE Trans. Speech and Audio Processing, vol. 4, no. 5, pp. 337-351, Sept. 1996.
[80] “Speechreading by Humans and Machines: Models, Systems, and Applications,” NATO ASI Series F: Computer and Systems Sciences, D.G. Stork and M.E. Hennecke, eds., vol. 150, 1996.
[81] W.H. Sumby and I. Pollack, “Visual Contribution to Speech Intelligibility in Noise,” J. Acoustical Soc. of Am., vol. 26, no. 2, pp. 212-215, Mar. 1954.
[82] Q. Summerfield, “Some Preliminaries to a Comprehensive Account of Audio-Visual Speech Perception,” Hearing by Eye: The Psychology of Lip-Reading, B. Dodd and R. Campbell, eds., pp. 3-51, 1987.
[83] M.J. Tomlinson, M.J. Russell, and N.M. Brooke, “Integrating Audio and Visual Information to Provide Highly Robust Speech Recognition,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 821-824, May 1996.
[84] M. Vogt, “Interpreted Multi-State Lip Models for Audio-Visual Speech Recognition,” Proc. ESCA Workshop Audio-Visual Speech Processing, pp. 125-128, Sept. 1997.
[85] B.E. Walden, R.A. Prosek, A.A. Montgomery, C.K. Scherr, and C.J. Jones, “Effects of Training on the Visual Recognition of Consonants,” J. Speech and Hearing Research, vol. 20, pp. 130-145, 1977.
[86] A.P. Witkin, “Scale-Space Filtering,” Proc. Eighth Int'l Joint Conf. Artificial Intelligence, vol. 2, pp. 1019-1022, 1983.
[87] J. Yang, R. Stiefelhagen, U. Meier, and A. Waibel, “Real-Time Face and Facial Feature Tracking and Applications,” Proc. Workshop Auditory-Visual Speech Processing, D. Burnham, J. Robert-Ribes, and E. Vatikiotis-Bateson, eds., pp. 79-84, Dec. 1998.
[88] S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, The HTK Book. Cambridge Univ., 1996.
[89] B.P. Yuhas, M.H. GoldsteinJr., and T.J. Sejnowski, “Integration of Acoustic and Visual Speech Signals Using Neural Networks,” IEEE Comm. Magazine, vol. 27, pp. 65-71, 1989.
[90] A.L. Yuille, P.W. Hallinan, and D.S. Cohen, "Feature extraction from faces using deformable templates," Int'l J. Computer Vision, vol. 8, no. 2, 133-144, 1992.

Index Terms:
Audio-visual speech recognition, statistical methods, active appearance model, sieve, connected-set morphology.
Citation:
Iain Matthews, Timothy F. Cootes, J. Andrew Bangham, Stephen Cox, Richard Harvey, "Extraction of Visual Features for Lipreading," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198-213, Feb. 2002, doi:10.1109/34.982900
Usage of this product signifies your acceptance of the Terms of Use.