Issue No.11 - Nov. (2012 vol.18)
pp: 1902-1914
B. H. Le , Dept. of Comput. Sci., Univ. of Houston, Houston, TX, USA
Xiaohan Ma , Dept. of Comput. Sci., Univ. of Houston, Houston, TX, USA
Zhigang Deng , Dept. of Comput. Sci., Univ. of Houston, Houston, TX, USA
This paper describes a fully automated framework to generate realistic head motion, eye gaze, and eyelid motion simultaneously based on live (or recorded) speech input. Its central idea is to learn separate yet interrelated statistical models for each component (head motion, gaze, or eyelid motion) from a prerecorded facial motion data set: 1) Gaussian Mixture Models and gradient descent optimization algorithm are employed to generate head motion from speech features; 2) Nonlinear Dynamic Canonical Correlation Analysis model is used to synthesize eye gaze from head motion and speech features, and 3) nonnegative linear regression is used to model voluntary eye lid motion and log-normal distribution is used to describe involuntary eye blinks. Several user studies are conducted to evaluate the effectiveness of the proposed speech-driven head and eye motion generator using the well-established paired comparison methodology. Our evaluation results clearly show that this approach can significantly outperform the state-of-the-art head and eye motion generation algorithms. In addition, a novel mocap+video hybrid data acquisition technique is introduced to record high-fidelity head movement, eye gaze, and eyelid motion simultaneously.
video signal processing, computer animation, data acquisition, eye, face recognition, Gaussian processes, gradient methods, image motion analysis, log normal distribution, optimisation, realistic images, statistical analysis, facial animation, fully automated framework, realistic head motion generation, eye gaze generation, eyelid motion generation, live speech input, live speech driven head-and-eye motion generators, statistical models, facial motion data set, Gaussian mixture models, gradient descent optimization algorithm, speech features, nonlinear dynamic canonical correlation analysis model, eye gaze synthesis, nonnegative linear regression, voluntary eye lid motion model, log-normal distribution, mocap+video hybrid data acquisition technique, high-fidelity head movement recording, eye gaze recording, eyelid motion recording, Speech, Magnetic heads, Hidden Markov models, Humans, Synchronization, Data acquisition, and live speech driven, Facial animation, head and eye motion coupling, head motion synthesis, gaze synthesis, blinking model
B. H. Le, Xiaohan Ma, Zhigang Deng, "Live Speech Driven Head-and-Eye Motion Generators", IEEE Transactions on Visualization & Computer Graphics, vol.18, no. 11, pp. 1902-1914, Nov. 2012, doi:10.1109/TVCG.2012.74
[1] I. Albrecht, J. Haber, and H.-P. Seidel, "Automatic Generation of Non-Verbal Facial Expressions from Speech," Proc. Computer Graphics Int'l (CGI '02), pp. 283-293, 2002.
[2] S. Chopra-Khullar and N.I. Badler, "Where to Look? Automating Attending Behaviors of Virtual Human Characters," AGENTS '99: Proc. Third Ann. Conf. Autonomous Agents, pp. 16-23, 1999.
[3] H.P. Graf, E. Cosatto, V. Strom, and F.J. Huang, "Visual Prosody: Facial Movements Accompanying Speech," Proc. IEEE Conf. Automatic Face and Gesture Recognition (FGR), pp. 396-401, 2002.
[4] V. Vinayagamoorthy, M. Garau, A. Steed, and M. Slater, "An Eye Gaze Model for Dyadic Interaction in an Immersive Virtual Environment: Practice and Experience," Computer Graphics Forum, vol. 23, no. 1, pp. 1-11, 2004.
[5] K.G. Munhall, J.A. Jones, D.E. Callan, T. Kuratate, and E. Vatikiotis-Bateson, "Visual Prosody and Speech Intelligibility: Head Movement Improves Auditory Speech Perception," Psychological Science, vol. 15, no. 2, pp. 133-137, 2004.
[6] E. Gu and N.I. Badler, "Visual Attention and Eye Gaze During Multiparty Conversations with Distractions," Proc. Sixth Int'l Conf. Intelligent Virtual Agents '06, pp. 193-204, 2006.
[7] S. Masuko and J. Hoshino, "Head-Eye Animation Corresponding to a Conversation for CG Characters," Computer Graphics Forum, vol. 26, no. 3, pp. 303-312, 2007.
[8] S.P. Lee, J.B. Badler, and N.I. Badler, "Eyes Alive," Proc. ACM SIGGRAPH '02, pp. 637-644, 2002.
[9] Z. Deng, J.P. Lewis, and U. Neumann, "Automated Eye Motion Using Texture Synthesis," IEEE Computer Graphics Applications, vol. 25, no. 2, pp. 24-30, Mar./Apr. 2005.
[10] C. Busso, Z. Deng, U. Neumann, and S. Narayanan, "Natural Head Motion Synthesis Driven by Acoustic Prosodic Features: Virtual Humans and Social Agents," Computer Animation and Virtual Worlds, vol. 16, nos. 3/4, pp. 283-290, 2005.
[11] E. Chuang and C. Bregler, "Mood Swings: Expressive Speech Animation," ACM Trans. Graphics, vol. 24, no. 2, pp. 331-347, 2005.
[12] S. Levine, C. Theobalt, and V. Koltun, "Real-Time Prosody-Driven Synthesis of Body Language," ACM Trans. Graphics, vol. 28, pp. 172:1-172:10, Dec. 2009.
[13] S. Levine, P. Krähenbühl, S. Thrun, and V. Koltun, "Gesture Controllers," ACM Trans. Graphics, vol. 29, pp. 124:1-124:11, July 2010.
[14] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D.H. Salesin, "Synthesizing Realistic Facial Expressions from Photographs," Proc. ACM SIGGRAPH '98, vol. 32, pp. 75-84, 1998.
[15] V. Blanz and T. Vetter, "A Morphable Model for the Synthesis of 3D Faces," Proc. ACM SIGGRAPH '99, pp. 187-194, 1999.
[16] Y. Lee, D. Terzopoulos, and K. Waters, "Realistic Modeling for Facial Animation," Proc. ACM SIGGRAPH '95, pp. 55-62, 1995.
[17] T. Weyrich, W. Matusik, H. Pfister, B. Bickel, C. Donner, C. Tu, J. McAndless, J. Lee, A. Ngan, H.W. Jensen, and M. Gross, "Analysis of Human Faces Using a Measurement-Based Skin Reflectance Model," ACM Trans. Graphics, vol. 25, no. 3, pp. 1013-1024, 2006.
[18] W.-C. Ma, A. Jones, J.-Y. Chiang, T. Hawkins, S. Frederiksen, P. Peers, M. Vukovic, M. Ouhyoung, and P. Debevec, "Facial Performance Synthesis Using Deformation-Driven Polynomial Displacement Maps," ACM Trans. Graphics, vol. 27, no. 5, pp. 1-10, 2008.
[19] T. Weise, H. Li, L. Van Gool, and M. Pauly, "Face/Off: Live Facial Puppetry," Proc. Symp. Computer Animation (SCA '09), pp. 7-16, 2009.
[20] C. Bregler, M. Covell, and M. Slaney, "Video Rewrite: Driving Visual Speech with Audio," Proc. ACM SIGGRAPH '97, pp. 353-360, 1997.
[21] M. Brand, "Voice Puppetry," Proc. ACM SIGGRAPH '99, pp. 21-28, 1999.
[22] T. Ezzat, G. Geiger, and T. Poggio, "Trainable Videorealistic Speech Animation," Proc. ACM SIGGRAPH '02, pp. 388-398, 2002.
[23] L. Zhang, N. Snavely, B. Curless, and S.M. Seitz, "Spacetime Faces: High-Resolution Capture for Modeling and Animation," ACM Trans. Graphics, vol. 23, no. 3, pp. 548-558, 2004.
[24] E. Sifakis, I. Neverov, and R. Fedkiw, "Automatic Determination of Facial Muscle Activations from Sparse Motion Capture Marker Data," ACM Trans. Graphics, vol. 24, no. 3, pp. 417-425, 2005.
[25] B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy, H. Pfister, and M. Gross, "Multi-Scale Capture of Facial Geometry and Motion," ACM Trans. Graphics, vol. 26, no. 3, p. 33, 2007.
[26] W.-W. Feng, B.-U. Kim, and Y. Yu, "Real-Time Data Driven Deformation Using Kernel Canonical Correlation Analysis," Proc. ACM SIGGRAPH '08, pp. 91:1-91:9, 2008.
[27] L. Williams, "Performance-Driven Facial Animation," Proc. ACM SIGGRAPH '90, pp. 235-242, 1990.
[28] J.-Y. Noh and U. Neumann, "Expression Cloning," Proc. ACM SIGGRAPH '01, pp. 277-288, 2001.
[29] R.W. Sumner and J. Popović, "Deformation Transfer for Triangle Meshes," ACM Trans. Graphics, vol. 23, no. 3, pp. 399-405, 2004.
[30] H. Li, T. Weise, and M. Pauly, "Example-Based Facial Rigging," Proc. ACM SIGGRAPH '10, pp. 32:1-32:6, 2010.
[31] Z. Deng and J.Y. Noh, "Computer Facial Animation: A Survey," Data-Driven 3D Facial Animation, Spring-Verlag Press, 2007.
[32] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone, "Animated Conversation: Rule-Based Generation of Facial Expression, Gesture and Spoken Intonation for Multiple Conversational Agents," Proc. ACM SIGGRAPH '94, pp. 413-420, 1994.
[33] C. Pelachaud, N.I. Badler, and M. Steedman, "Generating Facial Expressions for Speech," Cognitive Science, vol. 20, pp. 1-46, 1994.
[34] D. DeCarlo, C. Revilla, M. Stone, and J. Venditti, "Making Discourse Visible: Coding and Animating Conversational Facial Displays," Proc. Computer Animation '02, 2002.
[35] X. Ma, B.H. Le, and Z. Deng, "Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective," Proc. Ann. Conf. Human Factors in Computing Systems (CHI '11), pp. 2699-2702, 2011.
[36] C. Busso, Z. Deng, M. Grimm, U. Neumann, and S. Narayanan, "Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis," IEEE Trans. Audio, Speech and Language Processing, vol. 15, no. 3, pp. 1075-1086, Mar. 2007.
[37] M.E. Sargin, Y. Yemez, E. Erzin, and A.M. Tekalp, "Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 8, pp. 1330-1345, Aug. 2008.
[38] J. Lee and S. Marsella, "Learning a Model of Speaker Head Nods Using Gesture Corpora," Proc. Eighth Int'l Conf. Autonomous Agents and Multiagent Systems (AAMAS), pp. 289-296, 2009.
[39] G.O. Hofer, "Speech-Driven Animation Using Multi-Modal Hidden Markov Models," PhD dissertation, Univ. of Edinburgh, 2010.
[40] R.M. Maatman, J. Gratch, and S. Marsella, "Natural Behavior of a Listening Agent," Proc. Int'l Conf. Intelligent Virtual Agents (IVA), pp. 25-36, 2005.
[41] J. Gratch, A. Okhmatovskaia, F. Lamothe, S. Marsella, M. Morales, R.J. van der Werf, and L.-P. Morency, "Virtual Rapport," Proc. Int 'l Conf. Intelligent Virtual Agent (IVA), pp. 14-27, 2006.
[42] S. Masuko and J. Hoshino, "Generating Head Eye Movement for Virtual Actor," Systems and Computers in Japan, vol. 37, pp. 33-44, Nov. 2006.
[43] R.A. Colburn, M.F. Cohen, and S.M. Drucker, "The Role of Eye Gaze in Avatar Mediated Conversational Interfaces," Microsoft Research Technical Report MSR-TR-2000-81, 2000.
[44] M. Thiébaux, B. Lance, and S. Marsella, "Real-Time Expressive Gaze Animation for Virtual Humans," Proc. Eighth Int'l Conf. Autonomous Agents and Multiagent Systems (AAMAS '09) pp. 321-328, 2009.
[45] X. Ma and Z. Deng, "Natural Eye Motion Synthesis by Modeling Gaze-Head Coupling," Proc. IEEE Virtual Reality Conf. (VR '09), pp. 143-150, 2009.
[46] O. Oyekoya, W. Steptoe, and A. Steed, "A Saliency-based Method of Simulating Visual Attention in Virtual Scenes," Proc. 16th ACM Symp. Virtual Reality Software and Technology, pp. 199-206, 2009.
[47] R. Vertegaal, R. Slagter, G. van der Veer, and A. Nijholt, "Eye Gaze Patterns in Conversations: There Is More to Conversational Agents than Meets the Eyes," Proc. Conf. Human Factors in Computing Systems (CHI '01), pp. 301-308, 2001.
[48] D. Bitouk and S.K. Nayar, "Creating a Speech Enabled Avatar from a Single Photograph," Proc. IEEE Virtual Reality Conf. (VR '08), pp. 107-110, Mar. 2008.
[49] L. Itti and N. Dhavale, "Realistic Avatar Eye and Head Animation Using a Neurobiological Model of Visual Attention," Proc. SPIE, vol. 5200, pp. 64-78, 2003.
[50] C. Peters and C. O 'Sullivan, "Attention-Driven Eye Gaze and Blinking for Virtual Humans," Proc. ACM SIGGRAPH '03, p. 1, 2003.
[51] W. Steptoe, O. Oyekoya, and A. Steed, "Eyelid Kinematics for Virtual Characters," Computer Animation and Virtual Worlds, vol. 21, no. 1, pp. 161-171, 2010.
[52] F. Eyben, M. Wollmer, and B. Schuller, "OpenEAR - Introducing the Munich Open-Source Emotion and Affect Recognition Toolkit," Proc. Third Int'l Conf. Affective Computing and Intelligent Interaction and Workshops, pp. 1-6, 2009.
[53] Y. Abdel-Aziz and H. Karara, "Direct Linear Transformation from Comparator Coordinates into Object Space Coordinates in Close-Range Photogrammetry," Proc. Symp. Close-Range Photogrammetry, pp. 1-18, 1971.
[54] D. Brown, "Decentering Distortion of Lenses," Photogrammetric Eng., vol. 7, pp. 444-462, 1966.
[55] P. Riordan-Eva, D. Vaughan, and T. Asbury, General Ophthalmology. Stanford Univ. Press, 2004.
[56] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer, 2009.
[57] K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis. Academic Press, 1979.
[58] P. Lai and C. Fyfe, "Kernel and Nonlinear Canonical Correlation Analysis," Proc. IEEE-INNS-ENNS Int 'l Joint Conf. Neural Networks (IJCNN '00), 2000.
[59] C. Evinger, K. Manning, J. Pellegrini, M. Basso, A. Powers, and P. Sibony, "Not Looking while Leaping: The Linkage of Blinking and Saccadic Gaze Shifts," Experimental Brain Research, vol. 100, no. 1, pp. 337-344, 1994.
[60] A.R. Bentivoglio, S.B. Bressman, E. Cassetta, D. Carretta, P. Tonali, and A. Albanese, "Analysis of Blink Rate Patterns in Normal Subjects," Movement Disorder, vol. 12, no. 6, pp. 1028-1034, 1997.
[61] P. Ledda, A. Chalmers, T. Troscianko, and H. Seetzen, "Evaluation of Tone Mapping Operators Using a High Dynamic Range Display," Proc. ACM SIGGRAPH '05, pp. 640-648, 2005.
[62] M.G. Kendall and B. Babington-Smith, "On the Method of Paired Comparisons," Biometrika, vol. 31, pp. 324-345, 1940.
[63] S. Siegel, Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill Book Company, Inc., 1956.