This Article 
 Bibliographic References 
 Add to: 
Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data
March/April 2006 (vol. 12 no. 2)
pp. 266-276

Abstract—We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.

[1] R. Cole, D.W. Massaro, J. de Villiers, B. Rundle, K. Shobaki, J. Wouters, M. Cohen, J. Beskow, P. Stone, P. Connors, A. Tarachow, and D. Solcher, “New Tools for Interactive Speech and Language Training: Using Animated Conversational Agents in the Classrooms of Profoundly Deaf Children,” Proc. ESCA/SOCRRATES, 1999.
[2] R. Cole, S. Van Vuuren, B. Pellom, K. Hacioglu, J. Ma, J. Movellan, S. Schwartz, D. Wade-Stein, W. Ward, and J. Yan, “Perceptive Animated Interfaces: First Steps toward a New Paradigm for Human-Computer Interaction,” Proc. IEEE, vol. 91, no. 9, pp. 1391-1405, 2003.
[3] R.D. Kent and F.D. Minifie, “Coarticulation in Recent Speech Production Models,” J. Phonetics, vol. 5, pp. 115-135, 1977.
[4] J. Ma, R.A. Cole, B. Pellom, W. Ward, and B. Wise, “Accurate Automatic Visible Speech Synthesis of Arbitrary 3D Models Based on Concatenation of Diviseme Motion Capture Data,” J. Computer Animation and Virtual Worlds, vol. 15, no. 5, pp. 485-500, 2004.
[5] F. Parke, “Computer Generated Animation of Faces,” Proc. ACM Nat'l Conf., pp. 451-457, 1972.
[6] D. Terzopoulos and K. Waters, “Physically-Based Facial Modeling, Analysis, and Animation,” J. Visualization and Computer Animation, vol. 1, no. 4, pp. 73-80, 1990.
[7] C. Bregler, M. Covell, and M. Slaney, “Video Rewrite: Driving Visual Speech with Audio,” Proc. ACM SIGGRAPH, pp. 353-360, 1997.
[8] C. Kouadio, P. Poulin, and P. Lachapelle, “Real Time Facial Animation Based upon a Bank of 3D Facial Expressions,” Proc. Computer Animation, pp. 128-136, 1998.
[9] J. Ma, J. Yan, and R. Cole, “CU Animate: Tools for Enabling Conversions with Animated Characters,” Proc. Int'l Conf. Spoken Language Processing, pp. 197-200, 2002.
[10] J. Ma and R. Cole, “Animating Visible Speech and Facial Expressions,” The Visual Computer, vol. 20, nos. 2-3, pp. 86-105, 2004.
[11] N. Magnenat-Thalmann, E. Primeau, and D. Thalmann, “Abstract Muscle Action Procedures for Human Face Animation,” The Visual Computer, vol. 3, no. 5, pp. 290-297, 1988.
[12] I.S. Pandzic and R. Forchheimer, MPEG-4 Facial Animation: The Standard, Implementation, and Applications. John Wiley and Sons, Inc., 2002.
[13] C. Pelachaud, N. Badler, and M. Steedman, “Linguistic Issues in Facial Animation,” Proc. Computer Animation, pp. 15-30, 1991.
[14] P. Cosi and G. Perin, “Labial Coarticulation Modeling for Realistic Facial Animation,” Proc. Int'l Conf. Multimodal Interfaces '02, pp. 505-510, 2002.
[15] J. Beskow, “Rule-Based Visual Speech Synthesis,” Proc. Eurospeech, pp. 299-302, 1995.
[16] L. Reveret and C. Benoit, “A New 3D Lip Model for Analysis and Synthesis of Lip Motion in Speech Production,” Proc. Second ESCA Workshop Audio-Visual Speech Processing, Dec. 1998.
[17] L. Williams, “Performance-Driven Facial Animation,” Proc. ACM SIGGRAPH Computer Graphics Conf., vol. 24, no. 4, pp. 235-242, 1990.
[18] J. Jiang, A. Alwan, P. Keating, E. Auer, and L. Bernstein, “On the Relationship between Facial Movements, Tongue Movements, and Speech Acoustics,” EURASIP J. Applied Signal Processing, special issue on joint audio-visual speech processing, vol. 11, pp. 1174-1188, 2002.
[19] S. Kshirsagar, T. Molet, N. Magnenat-Thalmann, “Principal Components of Expressive Speech Animation,” Proc. Computer Graphics Int'l Conf., pp. 38-44, 2002.
[20] M. Cohen and D.W. Massaro, “Modeling Coarticulation in Synthetic Visual Speech,” Proc. Computer Animation, pp. 139-156, 1993.
[21] P. Joshi, W.C. Tien, M. Desbrun, and F. Pighin, “Learning Controls for Blend Shape Based Realistic Facial Animation,” Proc. ACM SIGGRAPH Symp. Computer Animation, pp. 187-192, 2003.
[22] E. Chuang and C. Bregler, “Performance Driven Facial Animation Using Blendshape Interpolation,” Technical Report CS-TR-2002-02, Computer Science Dept., Stanford Univ., 2002.
[23] T. Vetter and T. Poggio, “Linear Object Classes and Image Synthesis From a Single Example Image,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 733-742, July 1997.
[24] E.C. Patterson, P.C. Litwinowicz, and N. Greene, “Facial Animation by Spatial Mapping,” Proc. Computer Animation, 1991.
[25] B. Guenter, C. Grimm, D. Wood, H. Malvar, and F. Pighin, “Making Faces,” Proc. SIGGRAPH, pp. 55-66, 1998.
[26] J.Y. Noh and U. Neumann, “Expression Cloning,” Proc. ACM SIGGRAPH, pp. 277-288, 2001.
[27] M. Sanchez, J. Edge, S. King, and S. Maddock, “Use and Re-Use of Facial Motion Capture Data,” Proc. Vision, Video, and Graphics Conf., pp. 1-8, 2003.
[28] F. Pighin, R. Szeliski, and D. Salesin, “Modeling and Animating Realistic Faces from Images,” Int'l J. Computer Vision, special issue on video computing, vol. 50, no. 2, pp. 143-169, 2002.
[29] T. Ezzat, G. Geiger, and T. Poggio, “Trainable Video Realistic Speech Animation,” Proc. ACM SIGGRAPH, pp. 388-398, 2002.
[30] G. Geiger, T. Ezzat, and T. Poggio, “Perceptual Evaluation of Video-Realistic Speech,” CBCL Paper #224/AI Memo #2003-003, Mass. Inst. of Technology, Cambridge, Mass., Feb. 2003.
[31] R. Parent, S. King, and O. Fujimura, “Issues in Lip-Sync Animation: Can You Read My Lips,” Computer Animation, pp. 3-10, June 2002.
[32] S.W. Choi, D. Lee, J.H. Park, and I.B. Lee, “Nonlinear Regression Using RBFN with Linear Sub Models,” Chemometrics and Intelligent Laboratory Systems, vol. 65, no. 2, pp. 191-208, 2003.
[33] The Festival Speech Synthesis System, /, 2006.
[34] N. Pellom and K. Hacioglu, “Recent Improvements in the SONIC ASR System for Noisy Speech: The SPINE Task,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 4-7, 2003.
[35] R.M. Haralick, H. Joo, C. Lee, X. Zhuang, V.G. Vaidya, and M.B. Kim, “Pose Estimation from Corresponding Point Data,” IEEE Trans. Systems, Man, and Cybernetics, vol. 19, no. 6, pp. 1426-1446, 1989.
[36] S. Roweis, “EM Algorithms for PCA and SPCA,” Advances in Neural Information Processing Systems, vol. 10, pp. 626-632, 1998.
[37] A.J. Hunt and A.W. Black, “Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 373-376, 1996.
[38] M. Lee, D.P. Lopresti, and J.P. Olive, “A Text-to-Speech Platform for Variable Length Optimal Unit Searching Using Perceptual Cost Functions,” Proc. ISCA Research Workshop Speech Synthesis, pp. 347-356, Aug.-Sept. 2001.
[39] E. Cosatto and H.P. Graf, “Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads,” Proc. Int'l Congress on Math. Education 2000, vol. 2, pp. 619-622, 2000.
[40] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[41] Y. Cao, P. Faloutsos, E. Kohler, and F. Pighin, “Real-Time Speech Motion Synthesis from Recorded Motions,” Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation, 2004.
[42] X. Huang, A. Acero, and X. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, 2001.
[43] G. Feng, “Data Smoothing by Cubic Spline Filters,” IEEE Trans. Signal Processing, vol. 46, no. 10, pp. 2790-2796, 1998.
[44] B. Wise, R. Cole, S. van Vuuren, S. Schwartz, L. Snyder, N. Ngampatipatpong, J. Tuantranont, and B. Pellom, “Learning to Read with a Virtual Tutor: Foundations Literacy,” Interactive Literacy Education, C. Kinzer and L. Verhoeven, eds., Mahwah, N.J.: Lawrence Erlbaum, 2005.

Index Terms:
Face animation, character animation, visual speech, visible speech, coarticulation effect, virtual human.
Jiyong Ma, Ron Cole, Bryan Pellom, Wayne Ward, Barbara Wise, "Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data," IEEE Transactions on Visualization and Computer Graphics, vol. 12, no. 2, pp. 266-276, March-April 2006, doi:10.1109/TVCG.2006.18
Usage of this product signifies your acceptance of the Terms of Use.