The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - October-December (2011 vol.2)
pp: 206-218
Johannes Wagner , University of Augsburg, Augsburg
Florian Lingenfelser , University of Augsburg, Augsburg
Elisabeth André , University of Augsburg, Augsburg
Jonghwa Kim , University of Augsburg, Augsburg
Thurid Vogt , University of Augsburg, Augsburg
ABSTRACT
The study at hand aims at the development of a multimodal, ensemble-based system for emotion recognition. Special attention is given to a problem often neglected: missing data in one or more modalities. In offline evaluation the issue can be easily solved by excluding those parts of the corpus where one or more channels are corrupted or not suitable for evaluation. In real applications, however, we cannot neglect the challenge of missing data and have to find adequate ways to handle it. To address this, we do not expect examined data to be completely available at all time in our experiments. The presented system solves the problem at the multimodal fusion stage, so various ensemble techniques—covering established ones as well as rather novel emotion specific approaches—will be explained and enriched with strategies on how to compensate for temporarily unavailable modalities. We will compare and discuss advantages and drawbacks of fusion categories and extensive evaluation of mentioned techniques is carried out on the CALLAS Expressivity Corpus, featuring facial, vocal, and gestural modalities.
INDEX TERMS
Ensemble based systems, decision-level fusion, multimodal emotion recognition, missing data.
CITATION
Johannes Wagner, Florian Lingenfelser, Elisabeth André, Jonghwa Kim, Thurid Vogt, "Exploring Fusion Methods for Multimodal Emotion Recognition with Missing Data", IEEE Transactions on Affective Computing, vol.2, no. 4, pp. 206-218, October-December 2011, doi:10.1109/T-AFFC.2011.12
REFERENCES
[1] N. Ambady and R. Rosenthal, “Thin Slices of Expressive Behavior as Predictors of Interpersonal Consequences: A Meta-Analysis,” Psychological Bull. vol. 111, no. 2, pp. 256-274, 1992.
[2] T. Balomenos, A. Raouzaiou, S. Ioannou, S. Drosopoulos, A. Karpouzis, and S. Kollias, “Emotion Analysis in Man-Machine Interaction Systems,” Proc. Workshop Machine Learning for Multimodal Interaction, 2005.
[3] A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, “Combining Efforts for Improving Automatic Classification of Emotional User States,” Language Technologies, T. Erjavec, and J. Gros, eds., pp. 240-245, 2006.
[4] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A Database of German Emotional Speech,” Proc. Interspeech Conf., pp. 1517-1520, 2005.
[5] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information,” Proc. Int'l Conf. Multimodal Interfaces, pp. 205-211, 2004.
[6] G. Caridakis, L. Malatesta, L. Kessous, N. Amir, A. Paouzaiou, and K. Karpouzis, “Modeling Naturalistic Affective States via Facial and Vocal Expression Recognition,” Proc. Int'l Conf. Multimodal Interfaces, pp. 146-154, 2006.
[7] G. Caridakis, A. Raouzaiou, K. Karpouzis, and S. Kollias, “Synthesizing Gesture Expressivity Based on Real Sequences,” Proc. Workshop Multimodal Corpora, 2006.
[8] G. Caridakis, J. Wagner, A. Raouzaiou, Z. Curto, E. André, and K. Karpouzis, “A Multimodal Corpus for Gesture Expressivity Analysis Multimodal Corpora,” Proc. Workshop Multimodal Corpora, 2010.
[9] G. Castellano, L. Kessous, and G. Caridakis, “Multimodal Emotion Recognition from Expressive Faces, Body Gestures and Speech,” Proc Doctoral Consortium Second Int'l Conf. Affective Computing and Intelligent Interaction, 2007.
[10] L.S. Chen, T.S. Huang, T. Miyasato, and R. Nakatsum, “Multimodal Human Emotion/Expression Recognition,” Proc. Int'l Conf. Automatic Face and Gesture Recognition FG, p. 366, 1998.
[11] C. Demiroglu, D.V. Anderson, and M.A. Clements, “A Missing Data-Based Feature Fusion Strategy for Noise-Robust Automatic Speech Recognition Using Noisy Sensors,” Proc. Int'l Symp. Circuits and Systems, pp. 965-968, 2007.
[12] R. El Kaliouby and P. Robinson, “Generalization of a Vision-Based Computational Model of Mind-Reading,” Proc. Int'l Conf. Affective Computing and Intelligent Interfaces pp 582-589, 2005.
[13] J.L. Fleiss, B. Levin, and M.C. Paik, Statistical Methods for Rates and Proportions, third ed. John Wiley & Sons, 2003.
[14] F. Fragopanagos and J.G. Taylor, “Emotion Recognition in Human-Computer Interaction,” Neural Networks, vol. 18, no. 4, pp. 389-405, 2005.
[15] H. Gunes, M. Piccardi, and T. Jan, “Face and Body Gesture Recognition for a Vision-Based Multimodal Analyzer,” Proc. Pan-Sydney Area Workshop Visual Information Processing, pp. 19-28, 2004.
[16] M.A. Hall, “Correlation-Based Feature Subset Selection for Machine Learning,” master's thesis, Univ. of Waikato, Hamilton, New Zealand, Apr. 1998.
[17] B. Hartmann, M. Mancini, and C. Pelachaud, “Implementing Expressive Gesture Synthesis for Embodied Conversational Agents,” Proc. Second Conf. Int'l Soc. Gesture Studies, 2005.
[18] L. Huang, L. Xin, L. Zhao, and J. Tao, “Combining Audio and Video by Dominance in Bimodal Emotion Recognition,” Proc. Affective Computing and Intelligent Interaction, pp. 729-730, 2007.
[19] J. Kim and F. Lingenfelser, “Ensemble Approaches to Parametric Decision Fusion for Bimodal Emotion Recognition,” Proc. Int'l Conf. Bio-Inspired Systems and Signal Processing, pp. 460-463, 2010.
[20] C. Küblbeck and A. Ernst, “Face Detection and Tracking in Video Sequences Using the Modified Census Transformation,” J. Image and Vision Computing, vol. 24, no. 6, pp. 564-572, 2006.
[21] P.J. Lang, M.M. Bradley, and B.N. Cuthbert, “Motivated Attention: Affect, Activation, and Action,” Attention and Orienting: Sensory and Motivational Processes, P.J. Lang, R.F. Simons, and M.T. Balaban, eds., pp. 97-135, Erlbaum, 1997.
[22] F. Lingenfelser, J. Wagner, T. Vogt, J. Kim, and E. André, “Age and Gender Classification from Speech Using Decision Level Fusion and Ensemble Based Techniques,” Proc. Interspeech Conf., 2010.
[23] M. Pantic and L. Rothkrantz, “Toward an Affect-Sensitive Multimodal Human-Computer Interaction,” Proc. IEEE, vol. 91, no. 9, pp. 1370-1390, Sept. 2003.
[24] R. Polikar, “Ensemble Based Systems in Decision Making,” IEEE Circuits and Systems Magazine, vol. 6, no. 3, pp. 21-45, 2006.
[25] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009 Emotion Challenge,” Proc. Interspeech Conf., pp. 312-315, 2009.
[26] M. Rehm, N. Bee, and E. Andé, “Wave Like an Egyptian Accelerometer Based Gesture Recognition for Culture Specific Interactions,” Proc. Conf. HCI 2008 Culture, Creativity, Interaction, 2008.
[27] N. Sebe, I. Cohen, T. Gevers, and T.S. Huang, “Multimodal Approaches for Emotion Recognition: A Survey,” Proc. SPIE Conf., pp. 56-67, 2004.
[28] L.C. De Silva and P.C. Ng, “Bimodal Emotion Recognition,” Proc. Int'l Conf. Automatic Face and Gesture Recognition, pp. 332-335, 2000.
[29] M. Song, J. Bu, C. Chen, and N. Li, “Audio-Visual Based Emotion Recognition: A New Approach,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1020-1025, 2004.
[30] S. Steidl, “Automatic Classification of Emotion-Related User States in Spontaneous Children's Speech,” PhD thesis, Logos Verlag, Germany, 2009.
[31] E. Velten, “A Laboratory Task for Induction of Mood States,” Behavior Research and Therapy, vol. 35, pp. 72-82, 1998.
[32] T. Vogt and E. André, “Exploring the Benefits of Discretization of Acoustic Features for Speech Emotion Recognition,” Proc. Interspeech Conf., pp. 328-331, 2009.
[33] J. Wagner, E. André, and F. Jung, “Smart Sensor Integration: A Framework for Multimodal Emotion Recognition in Real-Time,” Proc. Affective Computing and Intelligent Interaction, 2009.
[34] Y. Yoshitomi, S. Kim, T. Kawano, and T. Kitazoe, “Effect of Sensor Fusion for Recognition of Emotional States Using Voice, Face Image, and Thermal Image of Face,” Proc. Int'l Workshop Robot-Human Interaction, 2000.
[35] Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T.S. Huang, and S. Levinson, “Audio-Visual Emotion Recognition through Multi-Stream Fused HMM for HCI Applications,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 967-972, 2005.
6 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool