This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces
November/December 2006 (vol. 12 no. 6)
pp. 1523-1534
Zhigang Deng, IEEE Computer Society
Ulrich Neumann, IEEE Computer Society
J.P. Lewis, IEEE Computer Society

Abstract—Synthesizing expressive facial animation is a very challenging topic within the graphics community. In this paper, we present an expressive facial animation synthesis system enabled by automated learning from facial motion capture data. Accurate 3D motions of the markers on the face of a human subject are captured while he/she recites a predesigned corpus, with specific spoken and visual expressions. We present a novel motion capture mining technique that "learns” speech coarticulation models for diphones and triphones from the recorded data. A Phoneme-Independent Expression Eigenspace (PIEES) that encloses the dynamic expression signals is constructed by motion signal processing (phoneme-based time-warping and subtraction) and Principal Component Analysis (PCA) reduction. New expressive facial animations are synthesized as follows: First, the learned coarticulation models are concatenated to synthesize neutral visual speech according to novel speech input, then a texture-synthesis-based approach is used to generate a novel dynamic expression signal from the PIEES model, and finally the synthesized expression signal is blended with the synthesized neutral visual speech to create the final expressive facial animation. Our experiments demonstrate that the system can effectively synthesize realistic expressive facial animation.

[1] E. Cosatto, “Sample-Based Talking-Head Synthesis,” PhD thesis, Swiss Federal Inst. of Technology, 2002.
[2] I.S. Pandzic, “Facial Animation Framework for the Web and Mobile Platforms,” Proc. Seventh Int'l Conf. 3D Web Technology, 2002.
[3] Z. Deng, M. Bulut, U. Neumann, and S.S. Narayanan, “Automatic Dynamic Expression Synthesis for Speech Animation,” Proc. IEEE Conf. Computer Animation and Social Agents (CASA), pp. 267-274, July 2004.
[4] Z. Deng, J.P. Lewis, and U. Neumann, “Synthesizing Speech Animation by Learning Compact Speech Coarticulation Models,” Proc. Computer Graphics Int'l Conf. (CGI), pp. 19-25, June, 2005.
[5] K. Pullen and C. Bregler, “Motion Capture Assisted Animation: Texturing and Synthesis,” ACM Trans. Graphics, pp. 501-508, 2002.
[6] A. Efros and T.K. Leung, “Texture Synthesis by Nonparametric Sampling,” Proc. Int'l Conf. Computer Vision (ICCV '99), pp. 1033-1038, 1999.
[7] L. Liang, C. Liu, Y.Q. Xu, B. Guo, and H.Y. Shum, “Real-Time Texture Synthesis by Patch-Based Sampling,” ACM Trans. Graphics, vol. 20, no. 3, 2001.
[8] F.I. Parke and K. Waters, Computer Facial Animation. Wellesley, Mass.: AK Peters, 1996.
[9] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone, “Animated Conversation: Rule-Based Generation of Facial Expression, Gesture and Spoken Intonation for Multiple Conversational Agents,” Proc. ACM SIGGRAPH Conf., pp. 413-420, 1994.
[10] P. Ekman and W.V. Friesen, Unmasking the Face: A Guide to Recognizing Emotions from Facial Clues. Prentice-Hall, 1975.
[11] J.Y. Noh and U. Neumann, “Expression Cloning,” Proc. ACM SIGGRAPH Conf., pp. 277-288, 2001.
[12] H. Pyun, Y. Kim, W. Chae, H.W. Kang, and S.Y. Shin, “An Example-Based Approach for Facial Expression Cloning,” Proc. 2003 ACM SIGGRAPH/Eurographics Symp. Computer Animation, pp.167-176, 2003.
[13] E.S. Chuang, H. Deshpande, and C. Bregler, “Facial Expression Space Learning,” Proc. Pacific Graphics Conf., pp. 68-76, 2002.
[14] Y. Cao, P. Faloutsos, and F. Pighin, “Unsupervised Learning for Speech Motion Editing,” Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation, 2003.
[15] Q. Zhang, Z. Liu, B. Guo, and H. Shum, “Geometry-Driven Photorealistic Facial Expression Synthesis,” Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation, 2003.
[16] V. Blanz, C. Basso, T. Poggio, and T. Vetter, “Reanimating Faces in Images and Video,” Computer Graphics Forum (Proc. Eurographics 2003 Conf.), vol. 22, no. 3, 2003.
[17] M. Brand, “Voice Puppetry,” Proc. ACM SIGGRAPH Conf., pp. 21-28, 1999.
[18] S. Kshirsagar, T. Molet, and N.M. Thalmann, “Principal Components of Expressive Speech Animation,” Proc. Computer Graphics Int'l Conf., 2001.
[19] A.S. Meyer, S. Garchery, G. Sannier, and N.M. Thalmann, “Synthetic Faces: Analysis and Applications,” Int'l J. Imaging Systems and Technology, vol. 13, no. 1, pp. 65-73, 2003.
[20] C. Pelachaud, N. Badler, and M. Steedman, “Generating Facial Expressions for Speech,” Cognitive Science, vol. 20, no. 1, pp. 1-46, 1994.
[21] C. Pelachaud and I. Poggi, “Subtleties of Facial Expressions in Embodied Agents,” J. Visualization and Computer Animation, vol. 13, no. 5, pp. 301-312, 2002.
[22] M.M. Cohen and D.W. Massaro, “Modeling Coarticulation in Synthetic Visual Speech,” Models and Techniques in Computer Animation, pp. 139-156, 1993.
[23] B.L. Goff and C. Benoit, “A Text-to-Audiovisual-Speech Synthesizer for French,” Proc. Int'l Conf. Spoken Language Processing (ICSLP), pp. 2163-2166, 1996.
[24] G. Kalberer and L.V. Gool, “Face Animation Based on Observed 3D Speech Dynamics,” Proc. IEEE Computer Animation Conf., pp.20-27, 2001.
[25] P. Cosi, C.E. Magno, G. Perlin, and C. Zmarich, “Labial Coarticulation Modeling for Realistic Facial Animation,” Proc. Int'l Conf. Multimodal Interfaces, pp. 505-510, 2002.
[26] S.A. King and R.E. Parent, “Creating Speech-Synchronized Animation,” IEEE Trans. Visualization and Computer Graphics, vol. 11, no. 3, pp. 341-352, 2005.
[27] C. Pelachaud, “Communication and Coarticulation in Facial Animation,” PhD thesis, Univ. of Pennsylvania, 1991.
[28] J. Beskow, “Rule-Based Visual Speech Synthesis,” Proc. Eurospeech Conf., 1995.
[29] E. Bevacqua and C. Pelachaud, “Expressive Audio-Visual Speech,” J. Visualization and Computer Animation, vol. 15, nos. 3-4, pp. 297-304, 2004.
[30] D. Terzopoulos and K. Waters, “Physically-Based Facial Modeling, Analysis, and Animation,” J. Visualization and Computer Animation, vol. 1, no. 4, pp. 73-80, 1990.
[31] Y. Lee, D. Terzopoulos, and K. Waters, “Realistic Modeling for Facial Animation,” Proc. ACM SIGGRAPH Conf., pp. 55-62, 1995.
[32] K. Waters and J. Frisble, “A Coordinated Muscle Model for Speech Animation,” Proc. Graphics Interface Conf., pp. 163-170, 1995.
[33] K. Kähler, J. Haber, and H.P. Seidel, “Geometry-Based Muscle Modeling for Facial Animation,” Proc. Graphics Interface Conf., 2001.
[34] C. Bregler, M. Covell, and M. Slaney, “Video Rewrite: Driving Visual Speech with Audio,” Proc. ACM SIGGRAPH Conf., pp. 353-360, 1997.
[35] Y. Cao, P. Faloutsos, E. Kohler, and F. Pighin, “Real-Time Speech Motion Synthesis from Recorded Motions,” Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation, pp. 345-353, 2004.
[36] T. Ezzat, G. Geiger, and T. Poggio, “Trainable Videorealistic Speech Animation,” ACM Trans. Graphics, vol. 21, no. 3, pp. 388-398, 2002.
[37] S. Kshirsagar and N.M. Thalmann, “Visyllable Based Speech Animation,” Computer Graphics Forum (Proc. Eurographics Conf.), vol. 22, no. 3, 2003.
[38] J.P. Lewis, “Automated Lip-Sync: Background and Techniques,” J.Visualization and Computer Animation, pp. 118-122, 1991.
[39] B. Bodenheimer, C.F. Rose, S. Rosenthal, and J. Pella, “The Process of Motion Capture: Dealing with the Data,” Proc. Eurographics Workshop Computer Animation and Simulation, 1997.
[40] Festival Speech Synthesis System, http://www.cstr.ed.ac.uk/projectsfestival , 2004.
[41] WaveSurfer, http://www.speech.kth.sewavesurfer, 2005.
[42] S. Roweis, “EM Algorithms for PCA and SPCA,” Proc. Conf. Neural Information Processing Systems (NIPS), pp. 137-148, 1997.
[43] S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S. Narayanan, “An Acoustic Study of Emotions Expressed in Speech,” Proc. Int'l Conf. Spoken Language Processsing, 2004.
[44] J. Ma, R. Cole, B. Pellom, W. Ward, and B. Wise, “Accurate Automatic Visible Speech Synthesis of Arbitrary 3D Model Based on Concatenation of Diviseme Motion Capture Data,” Computer Animation and Virtual Worlds, vol. 15, pp. 1-17, 2004.
[45] R. Cowie, E.D. Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellens, and J.G. Taylor, “Emotion Recognition in Human-Computer Interaction,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32-80, 2001.
[46] V. Petrushin, “Emotion in Speech: Recognition and Application to Call Centers,” Artificial Neural Networks in Eng., pp. 7-10, 1999.
[47] C.M. Lee, S. Narayanan, and R. Pieraccini, “Recognition of Negative Motions from the Speech Signal,” Proc. Conf. Automatic Speech Recognition and Understanding, 2003.
[48] J.P. Lewis, J. Mooser, Z. Deng, and U. Neumann, “Reducing Blendshape Interference by Selected Motion Attenuation,” Proc. ACM SIGGRAPH Symp. Interactive 3D Graphics and Games (I3D), pp. 25-29, 2005.
[49] Z. Deng, P. Chiang, P. Fox, and U. Neumann, “Animating Blendshape Faces by Cross-Mapping Motion Capture Data,” Proc. ACM SIGGRAPH Symp. Interactive 3D Graphics and Games, pp. 43-48, 2006.
[50] Z. Deng, J.P. Lewis, and U. Neumann, “Automated Eye Motion Synthesis Using Texture Synthesis,” IEEE Computer Graphics and Applications, pp. 24-30, Mar./Apr. 2005.
[51] T.Y. Kim and U. Neumann, “Interactive Multiresolution Hair Modeling and Editing,” ACM Trans. Graphics, vol. 21, no. 3, pp.620-629, 2002.
[52] S.A. King and R.E. Parent, “A 3D Parametric Tongue Model for Animated Speech,” J. Visualization and Computer Animation, vol. 12, no. 3, pp. 107-115, 1990.
[53] H.P. Graf, E. Cosatto, V. Strom, and F.J. Huang, “Visual Prosody: Facial Movements Accompanying Speech,” Proc. IEEE Int'l Conf. Automatic Faces and Gesture Recognition, 2002.
[54] C. Busso, Z. Deng, U. Neumann, and S. Narayanan, “Natural Head Motion Synthesis Driven by Acoustic Prosody Features,” Computer Animation and Virtual Worlds, vol 16, nos. 3/4, pp. 283-290, 2005.

Index Terms:
Facial animation, expressive speech, animation synthesis, speech coarticulation, texture synthesis, motion capture, data-driven.
Citation:
Zhigang Deng, Ulrich Neumann, J.P. Lewis, Tae-Yong Kim, Murtaza Bulut, Shrikanth Narayanan, "Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces," IEEE Transactions on Visualization and Computer Graphics, vol. 12, no. 6, pp. 1523-1534, Nov.-Dec. 2006, doi:10.1109/TVCG.2006.90
Usage of this product signifies your acceptance of the Terms of Use.