The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.11 - Nov. (2012 vol.18)
pp: 1915-1927
Xiaohan Ma , Dept. of Comput. Sci., Univ. of Houston, Houston, TX, USA
Zhigang Deng , Dept. of Comput. Sci., Univ. of Houston, Houston, TX, USA
ABSTRACT
In recent years, data-driven speech animation approaches have achieved significant successes in terms of animation quality. However, how to automatically evaluate the realism of novel synthesized speech animations has been an important yet unsolved research problem. In this paper, we propose a novel statistical model (called SAQP) to automatically predict the quality of on-the-fly synthesized speech animations by various data-driven techniques. Its essential idea is to construct a phoneme-based, Speech Animation Trajectory Fitting (SATF) metric to describe speech animation synthesis errors and then build a statistical regression model to learn the association between the obtained SATF metric and the objective speech animation synthesis quality. Through delicately designed user studies, we evaluate the effectiveness and robustness of the proposed SAQP model. To the best of our knowledge, this work is the first-of-its-kind, quantitative quality model for data-driven speech animation. We believe it is the important first step to remove a critical technical barrier for applying data-driven speech animation techniques to numerous online or interactive talking avatar applications.
INDEX TERMS
speech synthesis, computer animation, regression analysis, speech processing, interactive talking avatar applications, statistical quality model, data-driven speech animation approach, animation quality, SAQP, novel statistical model, on-the-fly synthesized speech animations, data-driven techniques, speech animation trajectory fitting metric, SATF, statistical regression model, Animation, Speech, Trajectory, Measurement, Principal component analysis, Predictive models, Face, statistical models, Facial animation, data-driven, visual speech animation, lip-sync, quality prediction
CITATION
Xiaohan Ma, Zhigang Deng, "A Statistical Quality Model for Data-Driven Speech Animation", IEEE Transactions on Visualization & Computer Graphics, vol.18, no. 11, pp. 1915-1927, Nov. 2012, doi:10.1109/TVCG.2012.67
REFERENCES
[1] B. Theobald, S. Fagel, G. Bailly, and F. Elisei, "Visual Speech Synthesis Challenge," Proc. Interspeech, pp. 1875-1878, 2008.
[2] C. Bregler, M. Covell, and M. Slaney, "Video Rewrite: Driving Visual Speech with Audio," Proc. ACM SIGGRAPH, pp. 353-360, 1997.
[3] S. Kshirsagar and N.M. Thalmann, "Visyllable Based Speech Animation," Computer Graphics Forum, vol. 22, no. 3, pp. 631-639, 2003.
[4] J. Ma, R. Cole, B. Pellom, W. Ward, and B. Wise, "Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data," IEEE Trans. Visualization and Computer Graphics, vol. 12, no. 2, pp. 266-276, Mar./Apr. 2006.
[5] Y. Cao, W.C. Tien, P. Faloutsos, and F. Pighin, "Expressive Speech-Driven Facial Animation," ACM Trans. Graphics, vol. 24, no. 4, pp. 1283-1302, 2005.
[6] Z. Deng and U. Neumann, "eFASE: Expressive Facial Animation Synthesis and Editing with Phoneme-Isomap Controls," Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation (SCA '06), pp. 251-260, 2006.
[7] K. Wampler, D. Sasaki, L. Zhang, and Z. Popović, "Dynamic, Expressive Speech Animation from a Single Mesh," Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation (SCA '07), pp. 53-62, 2007.
[8] Z. Deng and U. Neumann, "Expressive Speech Animation Synthesis with Phoneme-Level Control," Computer Graphics Forum, vol. 27, no. 8, pp. 2096-2113, 2008.
[9] M. Brand, "Voice Puppetry," Proc. ACM SIGGRAPH, pp. 21-28, 1999.
[10] T. Ezzat, G. Geiger, and T. Poggio, "Trainable Videorealistic Speech Animation," Proc. ACM SIGGRAPH, pp. 388-398, 2002.
[11] I.-J. Kim and H.-S. Ko, "3D Lip-Synch Generation with Data-Faithful Machine Learning," Computer Graphics Forum, vol. 26, no. 3, pp. 295-301, 2007.
[12] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D.H. Salesin, "Synthesizing Realistic Facial Expressions from Photographs," Proc. ACM SIGGRAPH, vol. 32, pp. 75-84, 1998.
[13] V. Blanz and T. Vetter, "A Morphable Model for the Synthesis of 3D Faces," Proc. ACM SIGGRAPH, pp. 187-194, 1999.
[14] Y. Lee, D. Terzopoulos, and K. Waters, "Realistic Modeling for Facial Animation," Proc. ACM SIGGRAPH, pp. 55-62, 1995.
[15] T. Weyrich, W. Matusik, H. Pfister, B. Bickel, C. Donner, C. Tu, J. McAndless, J. Lee, A. Ngan, H.W. Jensen, and M. Gross, "Analysis of Human Faces Using a Measurement-Based Skin Reflectance Model," ACM Trans. Graphics, vol. 25, no. 3, pp. 1013-1024, 2006.
[16] W.-C. Ma, A. Jones, J.-Y. Chiang, T. Hawkins, S. Frederiksen, P. Peers, M. Vukovic, M. Ouhyoung, and P. Debevec, "Facial Performance Synthesis Using Deformation-Driven Polynomial Displacement Maps," ACM Trans. Graphics, vol. 27, no. 5, pp. 1-10, 2008.
[17] T. Weise, H. Li, L. Van Gool, and M. Pauly, "Face/Off: Live Facial Puppetry," Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation (SCA '09), pp. 7-16, 2009.
[18] K. Singh and E. Fiume, "Wires: A Geometric Deformation Technique," Proc. ACM SIGGRAPH, pp. 405-414, 1998.
[19] L. Zhang, N. Snavely, B. Curless, and S.M. Seitz, "Spacetime Faces: High-Resolution Capture for Modeling and Animation," ACM Trans. Graphics, vol. 23, no. 3, pp. 548-558, 2004.
[20] E. Sifakis, I. Neverov, and R. Fedkiw, "Automatic Determination of Facial Muscle Activations from Sparse Motion Capture Marker Data," ACM Trans. Graphics, vol. 24, no. 3, pp. 417-425, 2005.
[21] B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy, H. Pfister, and M. Gross, "Multi-Scale Capture of Facial Geometry and Motion," ACM Trans. Graphics, vol. 26, no. 3, p. 33, 2007.
[22] W.-W. Feng, B.-U. Kim, and Y. Yu, "Real-Time Data Driven Deformation Using Kernel Canonical Correlation Analysis," Proc. ACM SIGGRAPH, pp. 91:1-91:9, 2008.
[23] L. Williams, "Performance-Driven Facial Animation," Proc. ACM SIGGRAPH, pp. 235-242, 1990.
[24] J.-Y. Noh and U. Neumann, "Expression Cloning," Proc. ACM SIGGRAPH, pp. 277-288, 2001.
[25] R.W. Sumner and J. Popović, "Deformation Transfer for Triangle Meshes," ACM Trans. Graphics, vol. 23, no. 3, pp. 399-405, 2004.
[26] X. Ma, B.H. Le, and Z. Deng, "Style Learning and Transferring for Facial Animation Editing," Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation (SCA), pp. 114-123, Aug. 2009.
[27] H. Li, T. Weise, and M. Pauly, "Example-Based Facial Rigging," Proc. ACM SIGGRAPH, pp. 32:1-32:6, 2010.
[28] Z. Deng and J. Noh, "Computer Facial Animation: A Survey," Data-Driven 3D Facial Animation. Spring-Verlag Press, 2007.
[29] J.P. Lewis, "Automated Lip-Sync: Background and Techniques," J. Visualization and Computer Animation, vol. 2, pp. 118-122, 1991.
[30] M. Cohen and D. Massaro, "Modeling Co-Articulation in Synthetic Visual Speech," Model and Technique in Computer Animation, pp. 139-156, 1993.
[31] S.A. King and R.E. Parent, "Creating Speech-Synchronized Animation," IEEE Trans. Visualization and Computer Graphics, vol. 11, no. 3, pp. 341-352, May/June 2005.
[32] C. Pelachaud, "Communication and Coarticulation in Facial Animation," PhD thesis, Univ. of Pennsylvania, 1991.
[33] S. Daly, "The Visible Differences Predictor: An Algorithm for the Assessment of Image Fidelity," Proc. SPIE, vol. 1666, pp. 179-206, 1993.
[34] M. Ramasubramanian, S.N. Pattanaik, and D.P. Greenberg, "A Perceptually Based Physical Error Metric for Realistic Image Synthesis," Proc. ACM SIGGRAPH, pp. 73-82, 1999.
[35] H. Yee, S. Pattanaik, and D.P. Greenberg, "Spatiotemporal Sensitivity and Visual Attention for Efficient Rendering of Dynamic Environments," ACM Trans. Graphics, vol. 20, no. 1, pp. 39-65, 2001.
[36] K. Myszkowski, T. Tawara, H. Akamine, and H.-P. Seidel, "Perception-Guided Global Illumination Solution for Animation Rendering," Proc. ACM SIGGRAPH, pp. 221-230, 2001.
[37] J.K. Hodgins, J.F. O'Brien, and J. Tumblin, "Perception of Human Motion with Different Geometric Models," IEEE Trans. Visualization and Computer Graphics, vol. 4, no. 4, pp. 307-316, Oct-Dec. 1998.
[38] C. O'Sullivan, J. Dingliana, T. Giang, and M.K. Kaiser, "Evaluating the Visual Fidelity of Physically Based Animations," ACM Trans. Graphics, vol. 22, no. 3, pp. 527-536, 2003.
[39] C. O'Sullivan, S. Howlett, Y. Morvan, R. McDonnell, and K. O'Conor, "Perceptually Adaptive Graphics," Proc. Eurographics State-of-the-Art Report (STAR), pp. 141-164, 2004.
[40] C. Wallraven, J. Fischer, D.W. Cunningham, D. Bartz, and H.H. Bülthoff, "The Evaluation of Stylized Facial Expressions," Proc. Third Symp. Applied Perception in Graphics and Visualization (APGV '06), pp. 85-92, 2006.
[41] G. Geiger, T. Ezzat, and T. Poggio, "Perceptual Evaluation of Video-Realistic Speech," MIT-AI-Memo 2003-003, Feb. 2003.
[42] D. Cosker, S. Paddock, D. Marshall, P.L. Rosin, and S. Rushton, "Toward Perceptually Realistic Talking Heads: Models, Methods, and McGurk," ACM Trans. Applied Perception, vol. 2, no. 3, pp. 270-285, 2005.
[43] A. Schwaninger, S. Schumacher, H. Bülthoff, and C. Wallraven, "Using 3D Computer Graphics for Perception: The Role of Local and Global Information in Face Processing," Proc. Fourth Symp. Applied Perception in Graphics and Visualization (APGV '07), pp. 19-26, 2007.
[44] C. Wallraven, M. Breidt, D.W. Cunningham, and H.H. Bulthoff, "Evaluating the Perceptual Realism of Animated Facial Expressions," ACM Trans. Applied Perception, vol. 4, no. 4, pp. 1-20, Jan. 2008.
[45] X. Ma, B.H. Le, and Z. Deng, "Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective," Proc. ACM SIGCHI Int'l Conf. Human Factors in Computing Systems (CHI), pp. 2699-2702, May 2011.
[46] Z. Deng and X. Ma, "Perceptually Guided Expressive Facial Animation," Proc. ACM SIGGRAPHEG Symp. Computer Animation (SCA '08), pp. 67-76, July 2008.
[47] H.P. Graf, E. Cosatto, and T. Ezzat, "Face Analysis for the Synthesis of Photo-Realistic Talking Heads," Proc. IEEE Int'l Conf. Automatic Face and Gesture Recognition, pp. 189-195, 2000.
[48] Festival, "The Festival Speech Synthesis System," http://www. cstr.ed.ac.uk/projectsfestival /, 2004.
[49] P. Joshi, W.C. Tien, M. Desbrun, and F. Pighin, "Learning Controls for Blend Shape Based Realistic Facial Animation," Proc. ACM SIGGRAPHEG Symp. Computer Animation (SCA '03), pp. 187-192, 2003.
[50] Q. Li and Z. Deng, "Orthogonal Blendshape Based Editing System for Facial Motion Capture Data," IEEE Computer Graphics and Applications, vol. 28, no. 6, pp. 76-82, Nov./Dec. 2008.
[51] F.D. la Torre and M.J. Black, "Robust Principal Component Analysis for Computer Vision," Proc. IEEE Int'l Conf. Computer Vision, vol. 1, pp. 362-369, 2001.
[52] F. Girosi, M. Jones, and T. Poggio, "Priors Stabilizers and Basis Functions: From Regularization to Radial, Tensor and Additive Splines," technical report, 1993.
[53] G. Wahba, "Mathematics of Computation," Spline Models for Observational Data, vol. 57, SIAM, 1991.
[54] M. Botsch and O. Sorkine, "On Linear Variational Surface Deformation Methods," IEEE Trans. Visualization and Computer Graphics, vol. 14, no. 1, pp. 213-230, Jan./Feb. 2008.
[55] C.K.I. Williams and C.E. Rasmussen, "Gaussian Processes for Regression," Advances in Neural Information Processing Systems 8, pp. 514-520, MIT press, 1996.
[56] C.E. Rasmussen, "Minimize Function," http://www.kyb. tuebingen.mpg.de/bs/people carl/, 2006.
[57] K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis. Academic Press, 1979.
[58] M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini, "Building a Large Annotated Corpus of English: The Penn Treebank," Computational Linguistics, vol. 19, pp. 313-330, June 1993.
760 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool