The Community for Technology Leaders
RSS Icon
Issue No.01 - January-March (2011 vol.2)
pp: 10-21
Chung-Hsien Wu , National Cheng Kung University, Tainan
Wei-Bin Liang , National Cheng Kung University, Tainan
This work presents an approach to emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information (AP) and semantic labels (SLs). For AP-based recognition, acoustic and prosodic features including spectrum, formant, and pitch-related features are extracted from the detected emotional salient segments of the input speech. Three types of models, GMMs, SVMs, and MLPs, are adopted as the base-level classifiers. A Meta Decision Tree (MDT) is then employed for classifier fusion to obtain the AP-based emotion recognition confidence. For SL-based recognition, semantic labels derived from an existing Chinese knowledge base called HowNet are used to automatically extract Emotion Association Rules (EARs) from the recognized word sequence of the affective speech. The maximum entropy model (MaxEnt) is thereafter utilized to characterize the relationship between emotional states and EARs for emotion recognition. Finally, a weighted product fusion method is used to integrate the AP-based and SL-based recognition results for the final emotion decision. For evaluation, 2,033 utterances for four emotional states (Neutral, Happy, Angry, and Sad) are collected. The speaker-independent experimental results reveal that the emotion recognition performance based on MDT can achieve 80.00 percent, which is better than each individual classifier. On the other hand, an average recognition accuracy of 80.92 percent can be obtained for SL-based recognition. Finally, combining acoustic-prosodic information and semantic labels can achieve 83.55 percent, which is superior to either AP-based or SL-Based approaches. Moreover, considering the individual personality trait for personalized application, the recognition accuracy of the proposed approach can be further improved to 85.79 percent.
Emotion recognition, acoustic-prosodic features, semantic labels, meta decision trees, personality trait.
Chung-Hsien Wu, Wei-Bin Liang, "Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels", IEEE Transactions on Affective Computing, vol.2, no. 1, pp. 10-21, January-March 2011, doi:10.1109/T-AFFC.2010.16
[1] J. Liu , Y. Xu , S. Senef , and V. Zue , “CityBrowser II: A Multimodal Restaurant Guide in Mandarin,” Proc. Int'l Symp. Chinese Spoken Language Processing, pp. 1-4, 2008.
[2] C.-H. Wu and G.-L. Yan , “Speech Act Modeling and Verification of Spontaneous Speech with Disfluency in a Spoken Dialogue System,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 3, pp. 330-344, May 2005.
[3] N. Roy , J. Pineau , and S. Thrun , “Spoken Dialogue Management Using Probabilistic Reasoning,” Proc. Ann. Meeting Assoc. for Computational Linguistics, pp. 93-100, 2000.
[4] D. Jurafsky , R. Ranganath , D. McFarland , “Extracting Social Meaning: Identifying Interactional Style in Spoken Conversation,” Proc. Human Language Technologies: The 2009 Ann. Conf. North Am. Chapter of the Assoc. for Computational Linguistics, pp. 638-646, 2009.
[5] C.-H. Wu , Z.-J. Chuang , and Y.-C. Lin , “Emotion Recognition from Text Using Semantic Label and Separable Mixture Model,” ACM Trans. Asian Language Information Processing, vol. 5, no. 2, pp. 165-182, June 2006.
[6] C. De Silva and P.C. NG , “Bimodal Emotion Recognition,” Proc. IEEE Int'l Conf. Automatic Face and Gesture Recognition, pp. 332-335, 2000.
[7] R. López-Cózar , Z. Callejas , M. Kroul , J. Nouza , and J. Silovský , “Two-Level Fusion to Improve Emotion Classification in Spoken Dialogue System,” Lecture Notes in Artificial Intelligence, pp. 617-624, Springer-Verlag, 2008.
[8] B. Schuller , G. Rigoll , and M. Lang , “Speech Emotion Recognition Combining Acoustic Features and Linguistic Information in a Hybrid Support Vector Machine-Belief Network Architecture,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 17-21, 2004.
[9] T. Vogt and E. André , “Exploring the Benefits of Discretization of Acoustic Features for Speech Emotion Recognition,” Proc. Int'l Speech Comm. Assoc., pp. 328-331, 2009.
[10] B. Schuller , S. Steidl , and A. Batliner , “The INTERSPEECH 2009 Emotion Challenge,” Proc. Int'l Speech Comm. Assoc., pp. 312-315, 2009.
[11] F. Yu , E. Chang , Y.-Q. Xu , and H.-Y. Shum , “Emotion Detection from Speech to Enrich Multimedia Content,” Proc. IEEE Pacific-Rim Conf. Multimedia, pp. 500-557, 2001.
[12] N. Amir , S. Ziv , and R. Cohen , “Characteristics of Authentic Anger in Hebrew Speech,” Proc. European Conf. Speech Comm. and Technology, pp. 713-716, 2003.
[13] C.-M. Lee and S.S. Narayanan , “Toward Detecting Emotions in Spoken Dialogs,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 293-303, Mar. 2005.
[14] L. Devillers , L. Lamel , and I. Vasilescu , “Emotion Detection in Task-Oriented Spoken Dialogues,” Proc. IEEE Int'l Conf. Multimedia and Expo, pp. 549-552, 2003.
[15] I. Luengo and E. Navas , and I. Hernáez , “Combining Spectral and Prosodic Information for Emotion Recognition in the Interspeech 2009 Emotion Challenge,” Proc. Int'l Speech Comm. Assoc., pp. 332-335, 2009.
[16] C.-C. Lee , E. Mower , C. Busso , S. Lee , and S. Narayanan , “Emotion Recognition Using Hierarchical Binary Decision Tree Approach,” Proc. Int'l Speech Comm. Assoc., pp. 320-323, 2009.
[17] L. Todorovski and S. Dzeroski , “Combining Classifiers with Meta Decision Trees,” Machine Learning, vol. 50, no. 3, pp. 223-249, 2003.
[18] A. Terracciano , M.S. Merritt , A.B. Zonderman , and M.K. Evans , “Personality Traits and Sex Differences in Emotion Recognition among African Americans and Caucasians,” Ann. New York Academy of Sciences, vol. 1000, pp. 309-312, Dec. 2003.
[19] H.J. Eysenck and S.B.G. Eysenck , Manual of the Eysenck Personality Questionnaire. Hodder and Stoughton, 1975.
[20] Z. Dong and Q. Dong, HowNet, http:/, 2010.
[21] A. Berger , S. Della Pietra , and V. Della Pietra , “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
[22] M. Slaney and G. McRoberts , “A Recognition System for Affective Vocalization,” Speech Comm., vol. 39, pp. 367-384, 2003.
[23] A. Paeschke and W. Sendlmeier , “Prosodic Characteristics of Emotional Speech: Measurements of Fundamental Frequency Movements,” Proc. Int'l Speech Comm. Assoc. Tutorial and Research Workshop Speech and Emotion, pp. 75-80, 2000.
[24] T. Nwe , S. Foo , and L. De Silva , “Speech Emotion Recognition Using Hidden Markov Models,” Speech Comm., vol. 41, no. 4, pp. 603-623, 2003.
[25] C.-H. Wu and Z.-J. Chuang , “Emotion Recognition from Speech Using IG-Based Feature Compensation,” Int'l J. Computational Linguistics and Chinese Language Processing, vol. 12, no. 1, pp. 65-78, 2007.
[26] X. Huang , A. Acero , and H.-W. Hon , “Prosody,” Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, first ed., ch. 15, Section 15.4.4, pp. 753-755, Prentice Hall PTR, 2005.
[27] C.-H. Wu and J.-H. Chen , “Automatic Generation of Synthesis Units and Prosodic Information for Chinese Concatenative Synthesis,” Speech Comm., vol. 35, pp. 219-237, 2001.
[28] E. Shriberg , A. Stolcke , D. Hakkani-Tur , and G. Tur , “Prosody-Based Automatic Segmentation of Speech into Sentences and Topics,” Speech Comm., vol. 32, nos. 1/2, pp. 124-154, 2000.
[29] V. Vapnik , The Natural of Statistical Learning Theory. Springer-Verlag, 2005.
[30] J.C. Platt , “Probabilities for SV Machines,” Advances in Large Margin Classifiers, pp. 61-74, MIT Press, 2000.
[31] V. Petrushin , “Emotion Recognition in Speech Signal: Experimental Study, Development, and Application,” Proc. Int'l Conf. Spoken Language Processing, pp. 222-225, 2000.
[32] J.R. Quinlan , C4.5:Programs for Machine Learning. Morgan Kaufmann.
[33] H. Soltau and A. Waibel , “Acoustic Models for Hyperarticulated Speech,” Proc. Int'l Conf. Spoken Language Processing, 2000.
[34] R.S. Lazarus and B.N. Lazarus , Passion and Reason: Making Sense of Our Emotions. Oxford Univ. Press, 1996.
[35] R. Agrawal , T. Imielinski , and A.N. Swami , “Mining Association Rules between Sets of Items in Large Databases,” Proc. ACM SIGMOD, pp. 207-216, 1993.
[36], Personality Test, , 2010.
[37] Temperament: A Brief Survey, with Modern Applications, http://intraspec.catemper0.php, 2010.
[38] S.J. Young , G. Evermann , M.J.F. Gales , T. Hain , D. Kershaw , G. Moore , J. Odell , D. Ollason , D. Povey , V. Valtchev , and P.C. Woodland , The HTK Book, Version 3.4. Cambridge Univ. Press, http:/, 2010.
[39] P. Boersma and D. Weenink, Praat: Doing Phonetics by Computer (Version 5.1.05), http:/www., 2010.
[40] P. Boersma , “, Accurate Short-Term Analysis of the Fundamental Frequency and the Harmonics-to-Noise Ratio of a Sampled Sound,” Proc. Inst. of Phonetic Sciences, vol. 17, pp. 97-110, 1993.
[41] C.-C. Chang and C.-J. Lin, LIBSVM—A Library for Support Vector Machines,, 2010.
[42] Z. Le, , Maximum Entropy Modeling Toolkit for Python and C++, , 2010.
[43] P.A. Devijver and J. Kittler , Pattern Recognition: A Statistical Approach. Prentice-Hall, 1982.
[44] J. Tao and T. Tan , “Emotion Perception and Recognition from Speech” Affective Information Processing, ch. 6, Section 6.2, pp. 96-97, Springer, 2009.
39 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool