The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - October-December (2011 vol.2)
pp: 192-205
Björn Schuller , Technische Universitat Munchen, Munchen
ABSTRACT
Most research efforts dealing with recognition of emotion-related states from the human speech signal concentrate on acoustic analysis. However, the last decade's research results show that the task cannot be solved to complete satisfaction, especially when it comes to real life speech data and in particular to the assessment of speakers' valence. This paper therefore investigates novel approaches to the additional exploitation of linguistic information. To ensure good applicability to the real world, spontaneous speech and nonacted nonprototypical emotions are examined in the recently popular dimensional model in 3D continuous space. As there is a lack of linguistic analysis approaches and experiments for this model, various methods are proposed. Best results are obtained with the described bag of n-gram and character n-gram approaches introduced for the first time for this task and allowing for advanced vector space representation of the spoken contents. Furthermore, string kernels are considered. By early fusion and combined space optimization of the proposed linguistic features with acoustic ones, the regression of continuous emotion primitives outperforms reported benchmark results on the VAM corpus of highly emotional face-to-face communication.
INDEX TERMS
Affective computing, speech emotion recognition, sentiment analysis, support vector regression, string kernels.
CITATION
Björn Schuller, "Recognizing Affect from Linguistic Information in 3D Continuous Space", IEEE Transactions on Affective Computing, vol.2, no. 4, pp. 192-205, October-December 2011, doi:10.1109/T-AFFC.2011.17
REFERENCES
[1] R. Picard, Affective Computing. MIT Press, 1997.
[2] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. Taylor, “Emotion Recognition in Human-Computer Interaction,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32-80, Jan. 2001.
[3] E. Shriberg, “Spontaneous Speech: How People Really Talk and Why Engineers Should Care,” Proc. Conf. Int'l Speech Comm. Assoc., pp. 1781-1784, 2005.
[4] B. Schuller, “Automatic Recognition of Emotion from Speech and Manual Interaction,” PhD dissertation, Technische Universität München, 2006.
[5] B. Schuller, C. Hage, D. Schuller, and G. Rigoll, “‘Mister D.J., Cheer Me Up!’: Musical and Textual Features for Automatic Mood Classification,” J. New Music Research, vol. 39, no. 1, pp. 13-34, 2010.
[6] M. Schröder, R. Cowie, D. Heylen, M. Pantic, C. Pelachaud, and B. Schuller, “Towards Responsive Sensitive Artificial Listeners,” Proc. Fourth Int'l Workshop Human-Computer Conversation, 2008.
[7] H. Gunes, B. Schuller, M. Pantic, and R. Cowie, “Emotion Representation, Analysis and Synthesis in Continuous Space: A Survey,” Proc. Int'l Workshop Emotion Synthesis, rePresentation, and Analysis in Continuous spacE, pp. 827-834, 2011.
[8] S. Arunachalam, D. Gould, E. Anderson, D. Byrd, and S. Narayanan, “Politeness and Frustration Language in Child-Machine Interactions,” Proc. Eurospeech, pp. 2675-2678, 2001.
[9] Z.-J. Chuang and C.-H. Wu, “Emotion Recognition Using Acoustic Features and Textual Content,” Proc. IEEE Int'l Conf. Multimedia and Expo, pp. 53-56, 2004.
[10] K. Dupuis and K. Pichora-Fuller, “Use of Lexical and Affective Prosodic Cues to Emotion by Younger and Older Adults,” Proc. Eighth Ann. Conf. Int'l Speech Comm. Assoc., pp. 2237-2240, 2007.
[11] T. Danisman and A. Alpkocak, “Feeler: Emotion Classification of Text Using Vector Space Model,” Proc. Artificial Intelligence and the Simulation of Behaviour Convention Comm., 2008.
[12] T. Polzehl, S. Sundaram, H. Ketabdar, M. Wagner, and F. Metze, “Emotion Classification in Children's Speech Using Fusion of Acoustic and Linguistic Features,” Proc. 10th Ann. Conf. the Int'l Speech Comm. Assoc., pp. 340-343, 2009.
[13] B. Schuller, J. Schenk, G. Rigoll, and T. Knaup, “The ‘Godfather’ vs. ‘Chaos’: Comparing Linguistic Analysis Based on Online Knowledge Sources and Bags-of-N-Grams for Movie Review Valence Estimation,” Proc. 10th Int'l Conf. Document Analysis and Recognition, pp. 858-862, 2009.
[14] C. Elliott, “The Affective Reasoner: A Process Model of Emotions in a Multi-Agent System,” PhD dissertation, Northwestern Univ., 1992.
[15] R. Cowie, E. Douglas-Cowie, B. Apolloni, J. Taylor, A. Romano, and W. Fellenz, “What a Neural Net Needs to Know about Emotion Words,” J. Computational Intelligence and Applications, pp. 109-114, 1999.
[16] F. de Rosis, A. Batliner, N. Novielli, and S. Steidl, “‘You Are Sooo Cool, Valentina!’ Recognizing Social Attitude in Speech-Based Dialogues with an ECA,” Affective Computing and Intelligent Interaction, A. Paiva, R. Prada, and R.W. Picard, eds., pp. 179-190, Springer, 2007.
[17] D. Litman and K. Forbes, “Recognizing Emotions from Student Speech in Tutoring Dialogues,” Proc. IEEE Workshop Automatic Speech Recognition and Understanding, pp. 25-30, 2003.
[18] X. Zhe and A. Boucouvalas, “Text-to-Emotion Engine for Real Time Internet Communication,” Proc. Int'l Symp. Comm. Systems, Networks, and Digital Signal Processings, pp. 164-168, 2002.
[19] B. Goertzel, K. Silverman, C. Hartley, S. Bugaj, and M. Ross, “The Baby Webmind Project,” Proc. Ann. Conf. Soc. for the Study of Artificial Intelligence and the Simulation of Behaviour, 2000.
[20] T. Wu, F. Khan, T. Fisher, L. Shuler, and W. Pottenger, “Posting Act Tagging Using Transformation-Based Learning,” Foundations of Data Mining and Knowledge Discovery, T.Y. Lin, S. Ohsuga, C.-J. Liau, X. Hu, and S. Tsumoto, eds., pp. 319-331, Springer, 2005.
[21] H. Liu, H. Liebermann, and T. Selker, “A Model of Textual Affect Sensing Using Real-World Knowledge,” Proc. Seventh Int'l Conf. Intelligent User Interfaces, pp. 125-132. 2003.
[22] B. Schuller, G. Rigoll, and M. Lang, “Speech Emotion Recognition Combining Acoustic Features and Linguistic Information in a Hybrid Support Vector Machine-Belief Network Architecture,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 577-580, 2004.
[23] J. Breese and G. Ball, “Modeling Emotional State and Personality for Conversational Agents,” Technical Report MS-TR-98-41, Microsoft, 1998.
[24] G. Rigoll, R. Müller, and B. Schuller, “Speech Emotion Recognition Exploiting Acoustic and Linguistic Information Sources,” Proc. 10th Int'l Conf. Speech and Computer, vol. 1, pp. 61-67, 2005.
[25] S. Steidl, C. Ruff, A. Batliner, E. Nöth, and J. Haas, “Looking at the Last Two Turns, I'd Say This Dialogue Is Doomed—Measuring Dialogue Success,” Proc. Seventh Int'l Conf. Text, Speech, and Dialogue, pp. 629-636. 2004.
[26] A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. Nöth, “How to Find Trouble in Communication,” Speech Comm., vol. 40, pp. 117-143, 2003.
[27] H. Ai, D. Litman, K. Forbes-Riley, M. Rotaru, J. Tetreault, and A. Purandare, “Using System and User Performance Features to Improve Emotion Detection in Spoken Tutoring Dialogs,” Proc. Ann. Conf. Int'l Speech Comm. Assoc., pp. 797-800, 2006.
[28] T.S. Polzin and A. Waibel, “Emotion-Sensitive Human-Computer Interfaces,” Proc. ISCA Workshop Speech and Emotion, pp. 201-206, 2000.
[29] J. Ang, R. Dhillon, E. Shriberg, and A. Stolcke, “Prosody-Based Automatic Detection of Annoyance and Frustration in Human-Computer Dialog,” Proc. Ann. Conf. Int'l Speech Comm. Assoc., pp. 2037-2040, 2002.
[30] C.M. Lee, S.S. Narayanan, and R. Pieraccini, “Combining Acoustic and Language Information for Emotion Recognition,” Proc. Conf. Int'l Speech Comm. Assoc., pp. 873-376, 2002.
[31] L. Devillers, I. Vasilescu, and L. Lamel, “Emotion Detection in Task-Oriented Spoken Dialogs,” Proc. Int'l Conf. Multimedia and Expo, pp. 549-552, 2003.
[32] B. Schuller, R. Müller, M. Lang, and G. Rigoll, “Speaker Independent Emotion Recognition by Early Fusion of Acoustic and Linguistic Features within Ensembles,” Proc. Conf. Int'l Speech Comm. Assoc., pp. 805-808, 2005.
[33] A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, “Combining Efforts for Improving Automatic Classification of Emotional User States,” Proc. First Int'l Language Technologies Conf., pp. 240-245, 2006.
[34] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning, pp. 137-142, 1998.
[35] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs Up? Sentiment Classification Using Machine Learning Techniques,” Proc. Conf. Empirical Methods in Natural Language Processing, pp. 79-86, 2002.
[36] B. Schuller, N. Köhler, R. Müller, and G. Rigoll, “Recognition of Interest in Human Conversational Speech,” Proc. Conf. Int'l Speech Comm. Assoc., pp. 793-796, 2006.
[37] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Emotion Recognition from Speech: Putting ASR in the Loop,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 4585-4588, 2009.
[38] M. Grimm, K. Kroschel, and S. Narayanan, “The Vera am Mittag German Audio-Visual Emotional Speech Database,” Proc. IEEE Int'l Conf. Multimedia and Expo, pp. 865-868, 2008.
[39] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, and S. Narayanan, “The INTERSPEECH 2010 Paralinguistic Challenge,” Proc. 11th Ann. Conf. Int'l Speech Comm. Assoc., pp. 2794-2797, 2010.
[40] J. Jeon, R. Xia, and Y. Liu, “Level of Interest Sensing in Spoken Dialog Using Multi-Level Fusion of Acoustic and Lexical Evidence,” Proc. 11th Ann. Conf. Int'l Speech Comm. Assoc., pp. 2802-2805, 2010.
[41] E. Douglas-Cowie, N. Campbell, R. Cowie, and P. Roach, “Emotional Speech: Towards a New Generation of Databases,” Speech Comm., vol. 40, nos. 1/2, pp. 33-60, 2003.
[42] D. Ververidis and C. Kotropoulos, “A State of the Art Review on Emotional Speech Databases,” Proc. First Richmedia Conf., pp. 109-119, 2003.
[43] B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acoustic Emotion Recognition: A Benchmark Comparison of Performances,” Proc. IEEE Workshop Automatic Speech Recognition and Understanding, pp. 552-557, 2009.
[44] A. Batliner, B. Schuller, S. Schaeffler, and S. Steidl, “Mothers, Adults, Children, Pets—Towards the Acoustics of Intimacy,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 4497-4500, 2008.
[45] M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, and R. Cowie, “Abandoning Emotion Classes—Towards Continuous Emotion Recognition with Modelling of Long-Range Dependencies,” Proc. Ninth Ann. Conf. Int'l Speech Comm. Assoc., pp. 597-600, 2008.
[46] M. Grimm and K. Kroschel, “Rule-Based Emotion Classification Using Acoustic Features,” Proc. Third Int'l Conf. Telemedicine and Multimedia Comm., p. 56, 2005.
[47] M. Grimm, K. Kroschel, H. Harris, C. Nass, B. Schuller, G. Rigoll, and T. Moosmayr, “On the Necessity and Feasibility of Detecting a Driver's Emotional State while Driving,” Proc. Second Int'l Conf. Affective Computing and Intelligent Interaction, pp. 126-138, 2007.
[48] M. Grimm, K. Kroschel, B. Schuller, G. Rigoll, and T. Moosmayr, “Acoustic Emotion Recognition in Car Environment Using a 3D Emotion Space Approach,” Proc. DAGA, pp. 313-314, 2007.
[49] R. Kehrein, “The Prosody of Authentic Emotions,” Proc. Speech Prosody, pp. 423-426, 2002.
[50] M. Grimm, K. Kroschel, E. Mower, and S. Narayanan, “Primitives-Based Evaluation and Estimation of Emotions in Speech,” Speech Comm., vol. 49, nos. 10/11, pp. 787-800, 2007.
[51] M. Grimm, K. Kroschel, and S. Narayanan, “Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 1520-6149. 2007.
[52] M.A. Hall, “Correlation-Based Feature Selection for Machine Learning,” PhD dissertation, Univ. of Waikato, 1999.
[53] I.H. Witten and E. Frank, Data Mining—Practical Machine Learning Tools and Techniques, second ed. Morgan Kaufmann, 2005.
[54] D. Ververidis and C. Kotropoulos, “Fast Sequential Floating Forward Selection Applied to Emotional Speech Features Estimated on DES and SUSAS Data Collection,” Proc. European Signal Processing Conf., 2006.
[55] A.J. Smola and B. Schoelkopf, “A Tutorial on Support Vector Regression,” Statistics and Computing, vol. 14, no. 3, pp. 199-222, 2004.
[56] M. You, C. Chen, J. Bu, J. Liu, and J. Tao, “Emotion Recognition from Noisy Speech,” Proc. IEEE Int'l Conf. Multimedia and Expo, pp. 1653-1656, 2006.
[57] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009 Emotion Challenge,” Proc. 10th Ann. Conf. Int'l Speech Comm. Assoc., pp. 312-315, 2009.
[58] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, and N. Amir, “Whodunnit—Searching for the Most Important Feature Types Signalling Emotional User States in Speech,” Computer Speech and Language, vol. 25, pp. 4-28, 2011.
[59] B. Schuller, B. Vlasenko, R. Minguez, G. Rigoll, and A. Wendemuth, “Comparing One and Two-Stage Acoustic Modeling in the Recognition of Emotion in Speech,” Proc. IEEE Workshop Automatic Speech Recognition and Understanding, pp. 596-600, 2007.
[60] S.K. Shevade, S.S. Keerthi, C. Bhattacharyya, and K.R.K. Murthy, “Improvements to the SMO Algorithm for SVM Regression,” IEEE Trans. Neuronal Networks, vol. 11, no. 5, pp. 1188-1193, Sept. 2000.
[61] H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text Classification Using String Kernels,” J. Machine Learning Research, pp. 419-444, 2002.
[62] U. Iurgel, “Automatic Media Monitoring Using Stochastic Pattern Recognition Techniques,” PhD dissertation, Technische Universität München, Germany, 2007.
[63] T. Athanaselis, S. Bakamidis, I. Dologlu, R. Cowie, E. Douglas-Cowie, and C. Cox, “ASR for Emotional Speech: Clarifying the Issues and Enhancing Performance,” Neural Networks, vol. 18, pp. 437-444, 2005.
[64] M. Wöllmer, F. Eyben, J. Keshet, A. Graves, B. Schuller, and G. Rigoll, “Robust Discriminative Keyword Spotting for Emotionally Colored Spontaneous Speech Using Bidirectional LSTM Networks,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 3949-3952, 2009.
[65] S. Steidl, A. Batliner, D. Seppi, and B. Schuller, “On the Impact of Children's Emotional Speech on Acoustic and Language Models,” EURASIP J. Audio, Speech, and Music Processing, vol. 2010, pp. 1-15, 2010, doi:10.1155/2010/783954.
[66] D. Seppi, M. Gerosa, B. Schuller, A. Batliner, and S. Steidl, “Detecting Problems in Spoken Child-Computer Interaction,” Proc. First Workshop Child, Computer and Interaction, 2008.
[67] F. Metze, A. Batliner, F. Eyben, T. Polzehl, B. Schuller, and S. Steidl, “Emotion Recognition Using Imperfect Speech Recognition,” Proc. 11th Ann. Conf. Int'l Speech Comm. Assoc., pp. 478-481, 2010.
[68] M. Shaikh, H. Prendinger, and I. Mitsuru, “Assessing Sentiment of Text by Semantic Dependency and Contextual Valence Analysis,” Proc. Second Int'l Conf. Affective Computing and Intelligent Interaction, pp. 191-202, 2007.
[69] A.K. Seewald and F. Kleedorfer, “Lambda Pruning: An Approximation of the String Subsequence Kernel for Practical SVM Classification and Redundancy Clustering,” Advances in Data Analysis and Classification, vol. 1, no. 3, pp. 221-239, 2007.
[70] B. Schuller, R. Müller, F. Eyben, J. Gast, B. Hörnler, M. Wöllmer, G. Rigoll, A. Höthker, and H. Konosu, “Being Bored? Recognising Natural Interest by Extensive Audiovisual Integration for Real-Life Application,” J. Image and Vision Computing, vol. 27, pp. 1760-1774, 2009.
[71] J.B. Lovins, “Development of a Stemming Algorithm,” Mechanical Translation and Computational Linguistics, vol. 11, pp. 22-31, 1968.
[72] M. Porter, “Snowball Programming Language for Stemmers,” http:/snowball.tartarus.org/, 4.4.2008, 2011.
[73] A. Batliner, J. Buckow, R. Huber, V. Warnke, E. Nöth, and H. Niemann, “Prosodic Feature Evaluation: Brute Force or Well Designed?,” Proc. 14th Int'l Congress Phonetic Sciences, vol. 3, pp. 2315-2318, 1999.
[74] J. Russell, J. Bachorowski, and J. Fernandez-Dols, “Facial and Vocal Expressions of Emotion,” Ann. Rev. of Psychology, vol. 54, pp. 329-349, 2003.
[75] N. Campbell, H. Kashioka, and R. Ohara, “No Laughing Matter,” Proc. Ann. Conf. Int'l Speech Comm. Assoc., pp. 465-468, 2005.
[76] K. Truong and D. van Leeuwen, “Automatic Detection of Laughter,” Proc. Ann. Conf. Int'l Speech Comm. Assoc., pp. 485-488, 2005.
[77] P. Pal, A. Iyer, and R. Yantorno, “Emotion Detection from Infant Facial Expressions and Cries,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 809-812, 2006.
[78] S. Matos, S. Birring, I. Pavord, and D. Evans, “Detection of Cough Signals in Continuous Audio Recordings Using Hidden Markov Models,” IEEE Trans. Biomedical Eng., vol. 53, no. 6, pp. 1078-1083, June 2006.
[79] B. Schuller, F. Eyben, and G. Rigoll, “Static and Dynamic Modelling for the Recognition of Non-Verbal Vocalisations in Conversational Speech,” Proc. Fourth IEEE Tutorial and Research Workshop Perception and Interactive Technologies for Speech-Based Systems, pp. 99-110, 2008.
[80] B. Schuller, R. Müller, G. Rigoll, and M. Lang, “Applying Bayesian Belief Networks in Approximate String Matching for Robust Keyword-Based Retrieval,” Proc. IEEE Int'l Conf. Multimedia and Expo, vol. 3, 1999-2002, 2004.
[81] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book (v3.4). Cambridge Univ. Press, 2006.
[82] G. Rigoll, “The ALERT-System: Advanced Broadcast Speech Recognition Technology for Selective Dissemination of Multimedia Information,” Proc. IEEE Workshop Automatic Speech Recognition and Understanding, pp. 301-306, 2001.
[83] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A Database of German Emotional Speech,” Proc. Ann. Conf. Int'l Speech Comm. Assoc., pp. 1517-1520, 2005.
[84] W. Lizhong, S. Oviatt, and P.R. Cohen, “Multimodal Integration— A Statistical View,” IEEE Trans. Multimedia, vol. 1, no. 4, pp. 334-341, Dec. 1999.
[85] L. Gillick and S.J. Cox, “Some Statistical Issues in the Comparison of Speech Recognition Algorithms,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 23-26, 1989.
[86] J. Pittermann, A. Pittermann, and W. Minker, Handling Emotions in Human-Computer Dialogues. Springer, 2008.
[87] B. Schuller, D. Seppi, A. Batliner, A. Maier, and S. Steidl, “Towards More Reality in the Recognition of Emotional Speech,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 4, pp. 941-944, 2007.
[88] T. Wilson, J. Wiebe, and R. Hwa, “Just How Mad Are You? Finding Strong and Weak Opinion Clauses,” Proc. Conf. Am. Assoc. for Artificial Intelligence, pp. 761-769, 2004.
[89] A.-M. Popescu and O. Etzioni, “Extracting Product Features and Opinions from Reviews,” Proc. Human Language Technology Conf. and the Conf. Empirical Methods in Natural Language Processing, pp. 339-346, 2005.
[90] S.-M. Kim and E. Hovy, “Automatic Detection of Opinion Bearing Words and Sentences,” Proc. Companion Vol. to the Int'l Joint Conf. Natural Language Processing, pp. 61-66, 2005.
[91] M. Missen and M. Boughanem, “Using WordNet's Semantic Relations for Opinion Detection in Blogs,” Proc. 31th European Conf. IR Research on Advances in Information Retrieval, pp. 729-733, 2009.
[92] J. Yi, T. Nasukawa, R. Bunescu, and W. Niblack, “Sentiment Analyzer: Extracting Sentiments about a Given Topic Using Natural Language Processing Techniques,” Proc. IEEE Int'l Conf. Data Mining, pp. 427-434, 2003.
[93] Z. Fei, X. Huang, and L. Wu, “Mining the Relation between Sentiment Expression and Target Using Dependency of Words,” Proc. 20th Pacific Asia Conf. Language, Information and Computation, 2006.
[94] N. Godbole, M. Srinivasaiah, and S. Skiena, “Large-Scale Sentiment Analysis for News and Blogs,” Proc. Int'l Conf. Weblogs and Social Media, 2007.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool