This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Synergy of Lip-Motion and Acoustic Features in Biometric Speech and Speaker Recognition
September 2007 (vol. 56 no. 9)
pp. 1169-1175

Abstract—This paper presents the scheme and evaluation of a robust audio-visual digit-and-speaker-recognition system using lip motion and speech biometrics. Moreover, a liveness verification barrier based on a person's lip movement is added to the system to guard against advanced spoofing attempts such as replayed videos. The acoustic and visual features are integrated at the feature level and evaluated first by a Support Vector Machine for digit and speaker identification and, then, by a Gaussian Mixture Model for speaker verification. Based on ≈ 300 different personal identities, this paper represents, to our knowledge, the first extensive study investigating the added value of lip motion features for speaker and speech-recognition applications. Digit recognition and person-identification and verification experiments are conducted on the publicly available XM2VTS database showing favorable results (speaker verification is 98 percent, speaker identification is 100 percent, and digit identification is 83 percent to 100 percent).

[1] E. Bigun, J. Bigun, B. Duc, and S. Fischer, “Expert Conciliation for Multi Modal Person Authentication Systems by Bayesian Statistics,” Proc. First Int'l Conf. Audio- and Video-Based Person Authentication (AVBPA '97), J. Bigun, G. Chollet, and G. Borgefors, eds., pp. 291-300, 1997.
[2] J. Bigun, G. Granlund, and J. Wiklund, “Multidimensional Orientation Estimation with Applications to Texture Analysis of Optical Flow,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 775-790, Aug. 1991.
[3] K.R. Brunelli and D. Falavigna, “Person Identification Using Multiple Cues,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 10, pp. 955-966, Oct. 1995.
[4] C.J. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 1998.
[5] C.-C. Chang and C.-J. Lin, “LIBSVM—A Library for Support Vector Machines,” www.csie.ntu.edu.tw/cjlinlibsvm, 2001.
[6] T. Chen, “Audiovisual Speech Processing,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 9-21, 2001.
[7] C. Chibelushi, F. Deravi, and J. Mason, “A Review of Speech-Based Bimodal Recognition,” IEEE Trans. Multimedia, vol. 4, no. 1, pp. 23-37, 2002.
[8] P. Clarkson and P. Moreno, “On the Use of Support Vector Machines for Phonetic Classification,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '99), vol. 2, pp.585-588, 1999.
[9] S. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
[10] D. DeCarlo and D. Metaxas, “Optical Flow Constraints on Deformable Models with Applications to Face Tracking,” Int'l J. Computer Vision, vol. 38, no. 2, pp. 99-127, 2000.
[11] U. Dieckmann, P. Plankensteiner, and T. Wagner, “Acoustic-Labial Speaker Verification,” Proc. First Int'l Conf. Audio- and Video-Based Biometric Person Authentication (AVBPA '97), pp. 301-310, 1997.
[12] B. Duc, S. Fischer, and J. Bigun, “Face Authentication with Sparse Grid Gabor Information,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '97), vol. 4, no. 21, pp. 3053-3056, 1997.
[13] M.I. Faraj and J. Bigun, “Person Verification by Lip-Motion,” Proc. Conf. Computer Vision and Pattern Recognition Workshop (CVPRW '06), pp. 37-45, 2006.
[14] K. Farrell, R. Mammone, and K. Assaleh, “Speaker Recognition Using Neural Networks and Conventional Classifiers,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 1, pp. 194-205, 1994.
[15] I. Gavat, G. Costache, and C. Iancu, “Robust Speech Recognizer Using Multiclass SVM,” Proc. Seventh Seminar Neural Network Applications in Electrical Eng. (NEUREL '04), pp. 63-66, 2004.
[16] G.H. Granlund, “In Search of a General Picture Processing Operator,” Computer Graphics and Image Processing, vol. 8, no. 2, pp. 155-173, 1978.
[17] T.J. Hazen, “Visual Model Structures and Synchrony Constraints for Audio-Visual Speech Recognition,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 1082-1089, 2006.
[18] B. Horn and B. Schunck, “Determining Optical Flow,” J. Artificial Intelligence, vol. 17, no. 1, pp. 185-203, 1981.
[19] P. Jourlin, J. Luettin, D. Genoud, and H. Wassner, “Acoustic-Labial Speaker Verification,” Proc. First Int'l Conf. Audio- and Video-Based Biometric Person Authentication (AVBPA '97), pp. 319-326, 1997.
[20] K. Kollreider, H. Fronthaler, and J. Bigun, “Evaluating Liveness by Face Images and the Structure Tensor,” Proc. Fourth IEEE Workshop Automatic Identification Advanced Technologies (AutoID '05), pp. 75-80, 2005.
[21] L. Liang, X.L.Y. Zhao, X. Pi, and A. Nefian, “Speaker Independent Audio-Visual Continuous Speech Recognition,” Proc. IEEE Int'l Conf. Multimedia and Expo (ICME '02), vol. 2, pp. 26-29, 2002.
[22] B.D. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 674-679, 1981.
[23] S. Lucey, T. Chen, S. Sridharan, and V. Chandran, “Integration Strategies for Audio-Visual Speech Processing: Applied to Text-Dependent Speaker Recognition,” IEEE Trans. Multimedia, vol. 7, no. 3, pp. 495-506, 2005.
[24] J. Luettin and G. Maitre, “Evaluation Protocol for the Extended M2VTS Database xm2vtsdb,” IDIAP Communication 98-054, Technical Report R R-21, number = IDIAP - 1998, 1998.
[25] J. Luettin and N. Thacker, “Speechreading Using Probabilistic Models,” Computer Vision and Image Understanding, vol. 65, no. 2, pp. 163-178, 1997.
[26] K. Mase and A. Pentland, “Automatic Lip-Reading by Optical-Flow Analysis,” Systems and Computers in Japan, vol. 22, no. 6, pp.67-76, 1991.
[27] K. Messer, J. Matas, J. Kittler, and J. Luettin, “Xm2vtsdb: The Extended M2VTS Database,” Proc. Second Int'l Conf. Audio- and Video-Based Biometric Person Authentication (AVBPA '99), pp. 72-77, 1999.
[28] E. Petajan, B. Bischoff, D. Bodoff, and N.M. Brooke, “An Improved Automatic Lipreading System to Enhance Speech Recognition,” Proc. SIGCHI Conf. Human Factors in Computing Systems (CHI '88), pp. 19-25, 1988.
[29] D. Reynolds, T. Quatieri, and R.B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10, nos. 1-3, pp. 19-41, 2000.
[30] D. Reynolds and R. Rose, “Robust Text-Independent Speaker Identification Using Gaussian Mixture Models,” IEEE Trans. Speech and Audio Processing, vol. 3, no. 1, pp. 72-83, 1995.
[31] M. Schmidt and H. Gish, “Speaker Identification via Support Vector Classifiers,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '96), pp. 105-108, 1996.
[32] X. Tang and X. Li, “Fusion of Audio-Visual Information Integrated Speech Processing,” Proc. Third Int'l Conf. Audio- and Video-Based Biometric Person Authentication (AVBPA '01), pp. 127-143, 2001.
[33] X. Tang and X. Li, “Video Based Face Recognition Using Multiple Classifiers,” Proc. Sixth IEEE Int'l Conf. Automatic Face and Gesture Recognition (FGR '04), pp. 345-349, 2004.
[34] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[35] P. Varshney, “Multisensor Data Fusion,” Electronics and Comm. Eng. J., vol. 9, no. 6, pp. 245-253, 1997.
[36] V. Wan and W. Campbell, “Support Vector Machines for Speaker Verification and Identification,” Proc. IEEE Signal Processing Soc. Workshop Neural Networks for Signal Processing X, vol. 2, pp. 775-784, 2000.
[37] T. Wark, S. Sridharan, and V. Chandran, “The Use of Speech and Lip Modalities for Robust Speaker Verification under Adverse Conditions,” Proc. IEEE Int'l Conf. Multimedia Computing and Systems (ICMCS '99), vol. 1, 1999.
[38] L. Williams, “Performance-Driven Facial Animation,” Proc. SIGGRAPH '90, pp. 235-242, 1990.
[39] E. Yamamoto, S. Nakamura, and K. Shikano, “Lip Movement Synthesis from Speech Based on Hidden Markov Models,” J.Speech Comm., vol. 26, no. 1, pp. 105-115, 1998.
[40] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book (for HTK Version 3.0), http://htk.eng.cam.ac.uk/docsdocs.shtml, 2000.

Index Terms:
Speech recognition, speaker recognition, motion estimation, normal image flow, normal image velocity, lip reading, lip motion, GMM, SVM, biometrics
Citation:
Maycel-Isaac Faraj, Josef Bigun, "Synergy of Lip-Motion and Acoustic Features in Biometric Speech and Speaker Recognition," IEEE Transactions on Computers, vol. 56, no. 9, pp. 1169-1175, Sept. 2007, doi:10.1109/TC.2007.1074
Usage of this product signifies your acceptance of the Terms of Use.