The Community for Technology Leaders
RSS Icon
Issue No.02 - April-June (2012 vol.3)
pp: 184-198
F. Eyben , Inst. for Human-Machine Commun., Tech. Univ. Munchen, Munchen, Germany
A. Katsamanis , Dept. of Electr. Eng., Univ. of Southern California, Los Angeles, CA, USA
M. Wollmer , Inst. for Human-Machine Commun., Tech. Univ. Munchen, Munchen, Germany
A. Metallinou , Dept. of Electr. Eng., Univ. of Southern California, Los Angeles, CA, USA
B. Schuller , Inst. for Human-Machine Commun., Tech. Univ. Munchen, Munchen, Germany
S. Narayanan , Dept. of Electr. Eng., Univ. of Southern California, Los Angeles, CA, USA
Human emotional expression tends to evolve in a structured manner in the sense that certain emotional evolution patterns, i.e., anger to anger, are more probable than others, e.g., anger to happiness. Furthermore, the perception of an emotional display can be affected by recent emotional displays. Therefore, the emotional content of past and future observations could offer relevant temporal context when classifying the emotional content of an observation. In this work, we focus on audio-visual recognition of the emotional content of improvised emotional interactions at the utterance level. We examine context-sensitive schemes for emotion recognition within a multimodal, hierarchical approach: bidirectional Long Short-Term Memory (BLSTM) neural networks, hierarchical Hidden Markov Model classifiers (HMMs), and hybrid HMM/BLSTM classifiers are considered for modeling emotion evolution within an utterance and between utterances over the course of a dialog. Overall, our experimental results indicate that incorporating long-term temporal context is beneficial for emotion recognition systems that encounter a variety of emotional manifestations. Context-sensitive approaches outperform those without context for classification tasks such as discrimination between valence levels or between clusters in the valence-activation space. The analysis of emotional transitions in our database sheds light into the flow of affective expressions, revealing potentially useful patterns.
neural nets, emotion recognition, hidden Markov models, learning (artificial intelligence), valence-activation space, context-sensitive learning, audiovisual emotion classification, human emotional expression, emotional display, improvised emotional interactions, emotion recognition, bidirectional long short-term memory neural networks, hierarchical hidden Markov model classifiers, hybrid HMM/BLSTM classifiers, Hidden Markov models, Context, Emotion recognition, Viterbi algorithm, Context modeling, Logic gates, Recurrent neural networks, emotional grammars., Audio-visual emotion recognition, temporal context, Hidden Markov models, bidirectional long short term memory, recurrent neural networks
F. Eyben, A. Katsamanis, M. Wollmer, A. Metallinou, B. Schuller, S. Narayanan, "Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification", IEEE Transactions on Affective Computing, vol.3, no. 2, pp. 184-198, April-June 2012, doi:10.1109/T-AFFC.2011.40
[1] R.E. Kaliouby, P. Robinson, and S. Keates, "Temporal Context and the Recognition of Emotion from Facial Expression," Proc. HCI Int'l Conf., June 2003.
[2] J.M. Carroll and J.A. Russell, "Do Facial Expressions Signal Specific Emotions? Judging Emotion from the Face in Context," J. Personality and Social Psychology, vol. 70, pp. 205-218, 1996.
[3] H.R. Knudsen and L.H. Muzekari, "The Effects of Verbal Statements of Context on Facial Expressions of Emotion," J. Nonverbal Behavior, vol. 7, pp. 202-212, 1983.
[4] T. Masuda, P.C. Ellsworth, B. Mesquita, J. Leu, S. Tanida, and E. Van de Veerdonk, "Placing the Face in Context: Cultural Differences in the Perception of Facial Emotion," J. Personality and Social Psychology, vol. 94, pp. 365-381, 2008.
[5] A. Mehrabian, "Communication without Words," Psychology Today, vol. 2, pp. 53-56, 1968.
[6] B. de Gelder and J. Vroomen, "The Perception of Emotions by Ear and by Eye," Cognition and Emotion, vol. 14, pp. 289-311, May 2000.
[7] K. Oatley and J.M. Jenkins, Understanding Emotions. Blackwell Publishers Ltd, 1996.
[8] A.K. Dey and G.D. Abowd, "Towards a Better Understanding of Context and Context-Awareness," Proc. First Int'l Symp. Handheld and Ubiquitous Computing, 1999.
[9] C.-C. Lee, C. Busso, S. Lee, and S. Narayanan, "Modeling Mutual Influence of Interlocutor Emotion States in Dyadic Spoken Interactions," Proc. 10th Ann. Conf. Int'l Speech Comm., 2009.
[10] C.M. Lee and S.S. Narayanan, "Toward Detecting Emotions in Spoken Dialogs," IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 293-303, Mar. 2005.
[11] J. Liscombe, G. Riccardi, and D. Hakkani-Tur, "Using Context to Improve Emotion Detection in Spoken Dialog Systems," Proc. Conf. Interspeech Comm., 2005.
[12] C. Conati, "Probabilistic Assessment of Users Emotions in Educational Games," Applied Artificial Intelligence, vol. 16, pp. 555-575, 2002.
[13] I. Cearreta, J.M. Lopez, and N. Garay-Vitoria, "Modelling Multimodal Context-Aware Affective Interaction," Proc. Doctoral Consortium Second Int'l. Conf. Affective Computing and Intelligent Interaction, 2007.
[14] G. McIntyre, "Towards Affective Sensing," Proc. 12th Int'l Conf. Human-Computer Interaction, 2007.
[15] A. Graves, S. Fernandez, and J. Schmidhuber, "Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition," Proc. 15th Int'l Conf. Artificial Neural Networks, vol. 18, pp. 602-610, 2005.
[16] M. Wöllmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan, "Context-Sensitive Multimodal Emotion Recognition from Speech and Facial Expression Using Bidirectional LSTM Modeling," Proc. 11th Ann. Conf. Int'l Speech Comm. Assoc., 2010.
[17] S. Fine, Y. Singer, and N. Tishby, "The Hierarchical Hidden Markov Model: Analysis and Applications," Machine Learning, vol. 32, pp. 41-62, 1998.
[18] A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, "Unconstrained Online Handwriting Recognition with Recurrent Neural Networks," Advances in Neural Information Processing Systems, vol. 20, pp. 1-8, 2008.
[19] A. McCabe and J. Trevathan, "Handwritten Signature Verifcation Using Complementary Statistical Models," J. Computers, vol. 4, pp. 670-680, 2009.
[20] M.J.F. Gales and S.J. Young, "The Application of Hidden Markov Models in Speech Recognition," Foundations and Trends in Signal Processing, vol. 1, pp. 195-304, 2008.
[21] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book. Entropic Cambridge Research Laboratory, 2006.
[22] C. Busso, M. Bulut, C-C Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. Narayanan, "IEMOCAP: Interactive Emotional Dyadic Motion Capture Database," Language Resources and Evaluation, vol. 42, pp. 335-359, 2008.
[23] M. Wöllmer, F. Eyben, B. Schuller, E. Douglas-Cowie, and R. Cowie, "Data-Driven Clustering in Emotional Space for Affect Recognition Using Discriminatively Trained LSTM Networks," Proc. 10th Ann. Conf. Int'l Speech Comm. Assoc., pp. 1595-1598, 2009.
[24] I. Cohen, A. Garg, and T.S. Huang, "Emotion Recognition from Facial Expressions Using Multilevel HMM," Proc. Neural Information Processing Systems, 2000.
[25] J.A. Nelder and R. Mead, "A Simplex Method for Function Minimization," Computer J., vol. 7, pp. 308-313, 1965.
[26] M. Pilu, "Video Stabilization as a Variational Problem and Numerical Solution with the Viterbi Method," Proc. IEEE CS Conf. Vision and Pattern Recognition, 2004.
[27] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies," A Field Guide to Dynamical Recurrent Neural Networks, S.C. Kremer and J.F. Kolen, eds., IEEE Press, 2001.
[28] S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[29] M. Schuster and K.K. Paliwal, "Bidirectional Recurrent Neural Networks," IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2673-2681, Nov. 1997.
[30] A. Graves and J. Schmidhuber, "Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures," Neural Networks, vol. 18, nos. 5/6, pp. 602-610, June 2005.
[31] M. Wöllmer, B. Schuller, F. Eyben, and G. Rigoll, "Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening," IEEE J. Selected Topics in Signal Processing, vol. 4, no. 5, pp. 867-881, Oct. 2010.
[32] A. Graves, "RNNLib Toolbox,", 2012.
[33] T. Bänziger and K.R. Scherer, "Using Actor Portrayals to Systematically Study Multimodal Emotion Expression: The GEMEP Corpus," Proc. Second Int'l Conf. Affective Computing and Intelligent Interaction, 2007.
[34] F. Enos and J. Hirschberg, "A Framework for Eliciting Emotional Speech: Capitalizing on the Actors Process," Proc. First Int'l Workshop Emotion: Corpora for Research on Emotion and Affect (Int'l Conf. Language Resources and Evaluation), 2006.
[35] R. Cowie, E. Douglas-Cowie, B. Apolloni, J. Taylor, A. Romano, and W. Fellenz, "What a Neural Net Needs to know About Emotion Words," Computational Intelligence and Applications, N. Mastorakis, ed., pp. 109-114, Word Scientific Eng. Soc., 1999.
[36] C.D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing, chapter 11. The MIT Press, 1999.
[37] R. Bakeman and J.M. Gottman, Observing Interaction: An Introduction to Sequential Analysis, second ed. Cambridge Univ. Press, 1997.
[38] R. Bakeman and V. Quera, Analyzing Interaction: Sequential Analysis with SDIS and GSEQ. Cambridge Univ. Press, 1995.
[39] I. Cohen, Q.T. Xiang, S. Zhou, X. Sean, Z. Thomas, and T.S. Huang, "Feature Selection Using Principal Feature Analysis," Proc. 15th Int'l Conf. Multimedia, 2002.
[40] A. Metallinou, C. Busso, S. Lee, and S. Narayanan, "Visual Emotion Recognition Using Compact Facial Representations and Viseme Information," Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 2474-2477, 2010.
[41] P. Boersma, "Praat, a System for Doing Phonetics by Computer," Glot Int'l, vol. 5, nos. 9/10, pp. 341-345, 2001.
[42] F. Eyben, M. Wöllmer, and B. Schuller, "Opensmile—The Munich Versatile and Fast Open-Source Audio Feature Extractor," Proc. ACM Multimedia, 2010.
[43] M.A. Hall, "Correlation-Based Feature Selection for Machine Learning," PhD thesis, Univ. of Waikato, 1999.
[44] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, second ed. Morgan Kaufmann, 2005.
[45] A.V. Nefian, L. Liang, X. Pi, L. Xiaoxiang, C. Mao, and K. Murphy, "A Coupled HMM for Audio-Visual Speech Recognition," Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 2013-2016. 2002,
[46] M. Brand, N. Oliver, and A. Pentland, "Coupled Hidden Markov Models for Complex Action Recognition," Proc. IEEE CS Conf. Vision and Pattern Recognition, 1997.
[47] G. Gravier, G. Potamianos, and C. Neti, "Asynchrony Modeling for Audio-Visual Speech Recognition," Proc. Second Int'l Conf. Human Language Technology Research, pp. 1-6, 2002.
[48] SPSS Base 10.0 for Windows User's Guide, SPSS Incorporated, 1999.
[49] J.A. Russell, J.-A. Bachorowski, and J.-M. Fernndez-Dols, "Facial and Vocal Expressions of Emotion," Ann. Rev. of Psychology, vol. 54, pp. 329-349, Feb. 2003.
[50] A. Metallinou, S. Lee, and S. Narayanan, "Decision Level Combination of Multiple Modalities for Recognition and Analysis of Emotional Expression," Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 2462-2465, 2010.
[51] H.Perez Espinosa, C.A. Reyes Garca, and L.V. Pineda, "Acoustic Feature Selection and Classification of Emotions in Speech Using a 3D Continuous Emotion Model," Proc. IEEE Ninth Int'l Conf. Automatic Face and Gesture Recognition, 2011.
58 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool