The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - Jan.-March (2012 vol.3)
pp: 116-125
S. Ntalampiras , Electr. & Comput. Eng. Dept., Univ. of Patras, Patras, Greece
N. Fakotakis , Electr. & Comput. Eng. Dept., Univ. of Patras, Patras, Greece
ABSTRACT
During recent years, the field of emotional content analysis of speech signals has been gaining a lot of attention and several frameworks have been constructed by different researchers for recognition of human emotions in spoken utterances. This paper describes a series of exhaustive experiments which demonstrate the feasibility of recognizing human emotional states via integrating low level descriptors. Our aim is to investigate three different methodologies for integrating subsequent feature values. More specifically, we used the following methods: 1) short-term statistics, 2) spectral moments, and 3) autoregressive models. Additionally, we employed a newly introduced group of parameters which is based on the wavelet decomposition. These are compared with a baseline set comprised of descriptors which are usually used for the specific task. Subsequently, we experimented on fusing these sets on the feature and log-likelihood levels. The classification step is based on hidden Markov models, while several algorithms which can handle redundant information were used during fusion. We report results on the well-known and freely available database BERLIN using data of six emotional states. Our experiments show the importance of including information which is captured by the set based on multiresolution analysis and the efficacy of merging subsequent feature values.
INDEX TERMS
wavelet transforms, emotion recognition, hidden Markov models, speech recognition, statistical analysis, hidden Markov models, temporal evolution modeling, acoustic parameters, speech emotion recognition, emotional content analysis, speech signals, human emotions, spoken utterances, low level descriptor integration, shortterm statistics, spectral moments, autoregressive models, wavelet decomposition, Hidden Markov models, Emotion recognition, Speech, Feature extraction, Computational modeling, Databases, Speech recognition, wavelet decomposition., Acoustic signal processing, speech emotion recognition, temporal feature integration, autoregressive models
CITATION
S. Ntalampiras, N. Fakotakis, "Modeling the Temporal Evolution of Acoustic Parameters for Speech Emotion Recognition", IEEE Transactions on Affective Computing, vol.3, no. 1, pp. 116-125, Jan.-March 2012, doi:10.1109/T-AFFC.2011.31
REFERENCES
[1] C.M. Lee and S.S. Narayanan, “Toward Detecting Emotions in Spoken Dialogs,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 293-303, Mar. 2005.
[2] M.E. Ayadi, M.S. Kamel, and F. Karray, “Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases,” Pattern Recognition, vol. 44, no. 3, pp. 572-587, Mar. 2011.
[3] G. Zhou, J.H.L. Hansen, and J.F. Kaiser, “Nonlinear Feature Based Classification of Speech under Stress,” IEEE Trans. Speech and Audio Processing, vol. 9, no. 2, pp. 201-216, Mar. 2001.
[4] Y. Li and Y. Zhao, “Recognizing Emotions in Speech Using Short-Term and Long-Term Features,” Proc. Int'l Conf. Spoken Language Processing, pp. 2255-2258, 1998.
[5] D.N. Jiang and L.-H. Cai, “Speech Emotion Classification with the Combination of Statistic Features and Temporal Features,” Proc. Int'l Conf. Multimedia and Expo, pp. 1967-1970, 2004.
[6] B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll, “Combining Frame and Turn-Level Information for Robust Recognition of Emotions within Speech,” Proc. Int'l Conf. Spoken Language Processing, pp. 2225-2228, 2007.
[7] B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll, “Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing,” Proc. Second Int'l Conf. Affective Computing and Intelligent Interaction, A. Paiva, ed., pp. 139-147, 2007.
[8] S. Wu, T.H. Falk, and W.-Y. Chan, “Automatic Recognition of Speech Emotion Using Long-Term Spectro-temporal Features,” Proc. Int'l Conf. Digital Signal Processing, pp. 205-210, 2009.
[9] A. Meng, P. Ahrendt, J. Larsen, and L.K. Hansen, “Temporal Feature Integration for Music Genre Classification,” IEEE Trans. Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1654-1664, July 2007.
[10] C. Joder, S. Essid, and G. Richard, “Temporal Integration for Audio Classification with Application to Musical Instrument Classification,” IEEE Trans. Audio, Speech, and Language Processing, vol. 17, no. 1, pp. 174-186, Jan. 2009.
[11] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “Exploiting Temporal Feature Integration for Generalized Sound Recognition,” EURASIP J. Advances in Signal Processing, vol. 2009, Article ID 807162, 2009, doi:10.1155/2009/807162.
[12] G. Zhou, J.H.L. Hansen, and J.F. Kaiser, “Methods for Stress Classification: Nonlinear Teo and Linear Speech Based Features,” Proc. IEEE Int'l Conf. Acoustics and Signal Processing, pp. 2087-2090, 1999.
[13] R. Fernandez and R.W. Picard, “Modeling Drivers' Speech under Stress,” Proc. Int'l Speech Comm. Assoc. Workshop Speech and Emotions, 2000.
[14] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “A Multidomain Approach for Automatic Home Environmental Sound Classification,” Proc. 11th Ann. Conf. the Int'l Speech Comm., pp. 2210-2213, 2010.
[15] B. Scharf, Critical Bands, in Foundations of Modern Auditory Theory, J.V. Tobias, ed., pp. 157-202. Academic Press, 1970.
[16] W.A. Yost, Fundamentals of Hearing, third ed., pp. 153-167. Academic Press, 1994.
[17] Torch Machine Learning Library, http:/www.torch.ch, 2012.
[18] J.-J. Aucouturier, B. Defreville, and F. Pachet, “The Bag-of-Frames Approach to Audio Pattern Recognition: A Sufficient Model for Urban Soundscapes but Not for Polyphonic Music,” J. Acoustical Soc. of Am., vol. 122, no. 2, pp. 881-891, Aug. 2007.
[19] M.M.H. El Ayadi, M.S. Kamel, and F. Karray, “Speech Emotion Recognition Using Gaussian Mixture Vector Autoregressive Models,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 4, pp. 957-960, 2007.
[20] L. Fu, X. Mao, and L. Chen, “Speaker Independent Emotion Recognition Based on SVM/HMMs Fusion System,” Proc. Int'l Conf. Audio, Language, and Image Processing, pp. 61-65, 2008.
[21] T. Schneider and A. Neumaier, “Algorithm 808: ARFIT—A Matlab Package for the Estimation of Parameters and Eigenmodes of Multivariate Autoregressive Models,” ACM Trans. Math. Software, vol. 27, no. 1, pp. 58-65, Mar. 2001.
[22] P. Delsarte and Y.V. Genin, “The Split Levinson Algorithm,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 34, no. 3, pp. 470-478, June 1986.
[23] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A Database of German Emotional Speech,” Proc. Int'l Conf. Spoken Language Processing, pp. 1517-1520, 2005.
[24] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, second ed. Morgan Kaufmann, 2005.
[25] N. Landwehr, M. Hall, and E. Frank, “Logistic Model Trees,” Proc. European Conf. Machine Learning, pp. 241-252, 2003.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool