Acoustics, Speech, and Signal Processing, IEEE International Conference on (2009)
Apr. 19, 2009 to Apr. 24, 2009
A. Gerald Friedland , Intern-l Computer Science Institute, 1947 Center Street Suite 600, Berkeley, CA, 94704, USA
B. Oriol Vinyals , Intern-l Computer Science Institute, 1947 Center Street Suite 600, Berkeley, CA, 94704, USA
C. Yan Huang , Intern-l Computer Science Institute, 1947 Center Street Suite 600, Berkeley, CA, 94704, USA
D. Christian Muller , German Research Center for AI, Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany
The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other long-term features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, we show how the top-ranked long-term features can be combined with short-term features to increase the accuracy of speaker diarization. The results were measured on standardized data sets (NIST RT) and show a consistent improvement of about 30% relative in diarization error rate compared to the best system presented at the NIST evaluation in 2007. This result was also verified on a wide set of meetings, which we call CombDev, that contains 21 meetings from previous evaluations. Since the prosodic and long-term features were selected using a diarization-independent speaker-discriminability study, we are confident that the same features are able to improve other systems that perform similar tasks
C. Yan Huang, B. Oriol Vinyals, A. Gerald Friedland and D. C. Muller, "Fusing short term and long term features for improved speaker diarization," Acoustics, Speech, and Signal Processing, IEEE International Conference on(ICASSP), Taipei, Taiwan, 2009, pp. 4077-4080.