The Community for Technology Leaders
RSS Icon
Issue No.08 - August (2008 vol.30)
pp: 1330-1345
We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. In the first stage analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of head gesture and speech prosody features separately to determine elementary head gesture and speech prosody patterns, respectively, for a particular speaker. In the second stage, joint analysis of correlations between these elementary head gesture and prosody patterns is performed using Multi-Stream HMMs to determine an audio-visual mapping model. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. In the synthesis stage, the audio-visual mapping model is used to predict a sequence of gesture patterns from the prosody pattern sequence computed for the input test speech. The Euler angles associated with each gesture pattern are then applied to animate the speaker head model. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech, as well as in ``prosody transplant" and ``gesture transplant" scenarios.
Audio input/output, Face and gesture recognition, Pattern analysis
Mehmet E. Sargin, Yucel Yemez, Engin Erzin, Ahmet M. Tekalp, "Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.30, no. 8, pp. 1330-1345, August 2008, doi:10.1109/TPAMI.2007.70797
[1] T. Chen, “Audiovisual Speech Processing,” IEEE Signal Processing Magazine, vol. 18, pp. 9-21, 2001.
[2] S. Morishima, K. Aizawa, and H. Harashima, “An Intelligent Facial Image Coding Driven by Speech and Phoneme,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '89), pp.1795-1798, 1989.
[3] C. Bregler, M. Covell, and M. Slaney, “Video Rewrite: Driving Visual Speech with Audio,” Proc. ACM SIGGRAPH '97, pp. 353-360, 1997.
[4] F. Huang and T. Chen, “Real-Time Lip-Synch Face Animation Driven by Human Voice,” Proc. IEEE Second Workshop Multimedia Signal Processing, pp. 352-357, 1998.
[5] E. Yamamoto, S. Nakamura, and K. ShiKano, “Lip Movement Synthesis from Speech Based on Hidden Markov Models,” Speech Comm., pp. 105-115, 1998.
[6] M. Brand, “Voice Puppetry,” Proc. 26th Ann. Conf. Computer Graphics and Interactive Techniques, pp. 21-28, 1999.
[7] P.S. Aleksic and A.K. Katsaggelos, “Speech-to-Video Synthesis Using Facial Animation Parameters,” IEEE Trans. Circuits and Systems for Video Technology, vol. 14, no. 5, pp. 682-692, 2004.
[8] Y. Li and H.-Y. Shum, “Learning Dynamic Audio-Visual Mapping with Inputoutput Hidden Markov Models,” IEEE Trans. Multimedia, vol. 8, no. 3, pp. 542-549, 2006.
[9] J. Xue, J. Borgstrom, J. Jiang, L. Bernstein, and A. Alwan, “Acoustically-Driven Talking Face Synthesis Using Dynamic Bayesian Networks,” Proc. Int'l Conf. Multimedia and Expo (ICME '06), pp. 1165-1168, 2006.
[10] L. Valbonesi, R. Ansari, D. McNeill, F. Quek, S. Duncan, K.E. McCullough, and R. Bryll, “Multimodal Signal Analysis of Prosody and Hand Motion: Temporal Correlation of Speech and Gestures,” Proc. European Signal Processing Conf. (EUSIPCO '02), vol. 1, pp. 75-78, 2002.
[11] K. Munhall, J.A. Jones, D.E. Callan, T. Kuratate, and E. Vatikiotis-Bateson, “Visual Prosody and Speech Intelligibility: Head Movement Improves Auditory Speech Perception,” Psychological Science, vol. 15, no. 2, pp. 133-137, 2004.
[12] F. Quek, D. McNeill, R. Ansari, X. Ma, R. Bryll, S. Duncan, and K. McCullough, “Gesture Cues for Conversational Interaction in Monocular Video,” Proc. Int'l Workshop Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pp. 64-69, 1999.
[13] T. Kuratate, K.G. Munhall, P.E. Rubin, E. Vatikiotis-Bateson, and H. Yehia, “Audio-Visual Synthesis of Talking Faces from Speech Production Correlates,” Proc. Sixth European Conf. Speech Comm. and Technology (EUROSPEECH '99), pp. 1279-1282, 1999.
[14] H.P. Graf, E. Cosatto, V. Strom, and F.J. Huang, “Visual Prosody: Facial Movements Accompanying Speech,” Proc. IEEE Int'l Conf. Automatic Face and Gesture Recognition, pp. 381-386, 2002.
[15] E. Chuang and C. Bregler, “Mood Swings: Expressive Speech Animation,” ACM Trans. Graphics, vol. 24, no. 2, pp. 331-347, 2005.
[16] Z. Deng, C. Busso, S. Narayanan, and U. Neumann, “Audio-Based Head Motion Synthesis for Avatar-Based Telepresence Systems,” Proc. ACM SIGMM Workshop Effective Telepresence (ETP '04), pp.24-30, 2004.
[17] M.E. Sargin, F. Ofli, Y. Yasinnik, O. Aran, A. Karpov, S. Wilson, E. Erzin, Y. Yemez, and A.M. Tekalp, “Gesture-Speech Correlation Analysis and Speech Driven Gesture Synthesis,” Proc. Int'l Conf. Multimedia and Expo (ICME '06), 2006.
[18] M. Naphade and T. Huang, “Discovering Recurrent Events in Video Using Unsupervised Methods,” Proc. Int'l Conf. Image Processing (ICIP '02), 2, pp. 13-16, 2002.
[19] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted Cascade of Simple Features,” Proc. IEEE Computer Vision and Pattern Recognition (CVPR '01), pp. 511-518, 2001.
[20] R. Lienhart and J. Maydt, “An Extended Set of Haar-Like Features for Rapid Object Detection,” Proc. Int'l Conf. Image Processing (ICIP '02), vol. 1, pp. 900-903, 2002.
[21] J.Y. Bouguet, Pyramidal Implementation of the Lucas Kanade Feature Trackerdescription of the Algorithm, OpenCVDocuments, Intel Corp., Microprocessor Research Labs, 1999.
[22] M. Brown, D. Burschka, and G. Hager, “Advances in Computational Stereo,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 993-1008, Aug. 2003.
[23] P. Fua, “Combining Stereo and Monocular Information to Compute Dense Depth Maps that Preserve Depth Discontinuities,” Proc. 12th Int'l Joint Conf. Artificial Intelligence, pp. 1292-1298, 1997.
[24] D. Varshalovich, A. Moskalev, and V. Khersonskii, “Description of Rotation in Terms of the Euler Angles,” Quantum Theory of Angular Momentum, World Scientific, 1988.
[25] K. Shoemake, “Animating Rotation with Quaternion Curves,” Proc. 12th Ann. Conf. Computer Graphics and Interactive Techniques, pp. 245-254, 1985.
[26] P. Boersma, “Accurate Short-Term Analysis of the Fundamental Frequency and the Harmonics-to-Noise Ratio of a Sampled Sound,” Proc. Inst. Phonetic Sciences, vol. 17, pp. 97-110, 1993.
[27] S. Ananthakrishnan and S. Narayanan, “An Automatic Prosody Recognizer Using a Coupled Multi-Stream Acoustic Model and a Syntactic-Prosodic Language Model,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '05), vol. 1, 2005.
[28] Point Grey Research Inc., http:/, 2008.
[29] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg, “Tobi: A Standard for Labeling English Prosody,” Proc. Int'l Conf. Spoken Language Processing (ICSLP '92), pp. 867-870, 1992.
[30] Momentum Inc., Speech-Driven Talking Head Avatar, http:/, 2008.
[31] Y. Bengio and P. Frasconi, “Input-Output HMMs for Sequence Processing,” IEEE Trans. Neural Networks, vol. 7, no. 5, pp. 1231-1249, 1996.
[32] R. Collobert, S. Bengio, and J. Mariethoz, “Torch: A Modular Machine Learning Software Library,” IDIAP Research Report, vol. 2, p. 46, 2002.
[33] Prosody-Driven Head Gesture Animation,, 2008.
[34] J.H. Manton, “Optimisation Algorithms Exploiting Unitary Constraints,” IEEE Trans. Signal Processing, vol. 50, no. 3, pp. 635-650, Mar. 2002.
[35] D. Demirdjian and T. Darrell, “Motion Estimation from Disparity Images,” Proc. Eighth IEEE Int'l Conf. Computer Vision, vol. 1, pp.213-218, 2001.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool