This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information
September 2007 (vol. 56 no. 9)
pp. 1212-1224
Human-machine interaction in meetings requires the localization and identification of the speakers interacting with the system as well as the recognition of the words spoken. A seminal step toward this goal is the field of rich transcription research, which includes speaker diarization together with the annotation of sentence boundaries and the elimination of speaker disfluencies. The sub-area of speaker diarization attempts to identify the number of participants in a meeting and create a list of speech time intervals for each such participant. In this paper, we analyze the correlation between signals coming from multiple microphones and propose an improved method for carrying out speaker diarization for meetings with multiple distant microphones. The proposed algorithm makes use of acoustic information and information from the delays between signals coming from the different sources. Using this procedure, we were able to achieve state-of-the-art performance in the NIST spring 2006 rich transcription evaluation, improving the Diarization Error Rate (DER) by 15% to 20% relative to previous systems.

[1] http://nist.gov/speech/tests/rtrt2002/, 2007.
[2] J. Ajmera, G. Lathoud, and I.A. Mc Cowan, “Clustering and Segmenting Speakers and Their Locations in Meetings,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '04), pp. I-605-I-608, 2004.
[3] G. Lathoud, I.A. Mc Cowan, and J.M. Odobez, “Unsupervised Location-Based Segmentation of Multi-Party Speech,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '04) —NIST Meeting Recognition Workshop, May 2004.
[4] G. Lathoud and I.A. Mc Cowan, “Location Based Speaker Segmentation,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '03), pp. I-176-I-179, 2003.
[5] G. Lathoud, “Further Applications of Sector-Based Detection and Short-Term Clustering,” IDIAP RR 06-26, May 2006.
[6] D.A. Reynolds and P. Torres-Carrasquillo, “Approaches and Applications of Audio Diarization,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '05), pp. V-953-V-956, 2005.
[7] NIST Spring 2006 (RT06s) Rich Transcription Meeting Recognition, http://www.nist.gov/speech/tests/rt/rt2006 spring/, 2007.
[8] S.E. Tranter and D.A. Reynolds, “Speaker Diarisation for Broadcast News,” Proc. Odyssey: The Speaker and Language Recognition Workshop, pp. 337-344, May 2004.
[9] C. Barras, X. Zhu, S. Meignier, and J.L. Gauvain, “Improving Speaker Diarization,” Proc. DARPA 2004 Rich Transcription Workshop (RT '04), Nov. 2004.
[10] S. Meignier, D. Moraru, C. Fredouillea, J.-F. Bonastre, and L. Besacier, “Step-by-Step and Integrated Approaches in Broadcast News Speaker Diarization,” Computer Speech and Language, vol. 20, pp. 303-330, Apr.-July 2006.
[11] S. Tranter and D.A. Reynolds, “An Overview of Automatic Speaker Diarization,” IEEE Trans. Audio, Speech and Language Eng., vol. 14, no. 5, pp. 1557-1565, Sept. 2006.
[12] X. Anguera, C. Wooters, B. Pesking, and M. Aguiló, “Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System,” Proc. NIST MLMI Meeting Recognition Workshop, 2005.
[13] X. Zhu, C. Barras, S. Meignier, and J.L. Gauvain, “Combining Speaker Identification and BIC for Speaker Diarization,” Proc. European Conf. Speech Comm. and Technology, pp. 2441-2444, Sept. 2005.
[14] D.A. Reynolds and P. Torres-Carrasquillo, “The MIT Lincoln Laboratory RT04F Diarization Systems: Applications to Broadcast Audio and Telephone Conversations,” Proc. Fall Rich Transcription Workshop (RT '04), 2004.
[15] R. Sinha, S.E. Tranter, M.J.F. Gales, and P.C. Woodland, “The Cambridge University March 2005 Speaker Diarization System,” Proc. European Conf. Speech Comm. and Technology, pp. 2437-2440, Sept. 2005.
[16] D. Moraru, S. Meignier, C. Fredouille, L. Besacier, and L. Bonastre, “The ELISA Consortium Approaches in Broadcast News Speaker Segmentation during the NIST 2003 Rich Transcription Evaluation,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '04), 2004.
[17] J. Ferreiros and D. Ellis, “Using Acoustic Condition Clustering to Improve Acoustic Change Detection on Broadcast News,” Proc. Int'l Conf. Spoken Language Processing (ICSLP '00), 2000.
[18] J. Ajmera and C. Wooters, “A Robust Speaker Clustering Algorithm,” Proc. IEEE Workshop Automatic Speech Recognition and Understanding (ASRU '03), 2003.
[19] S.S. Chen and P.S. Gopalakrishnan, “Speaker Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion,” Proc. DARPA Broadcast News Transcription and Understanding Workshop, Feb. 1998.
[20] http://www.nist.gov/speech/tests/rt/rt2003 spring/, 2007.
[21] http://www.nist.gov/speech/tests/rt/rt2004 fall/, 2007.
[22] C. Wooters, N. Mirghafori, A. Stolcke, T. Pirinen, I. Bulyko, D. Gelbart, M. Graciarena, S. Otterson, B. Peskin, and M. Ostendorf, “The 2004 ICSI-SRI-UW Meeting Recognition System,” Proc. Joint AMI/Pascal/IM2/M4 Workshop Meeting Recognition, 2005.
[23] C. Wooters, J. Fung, B. Pesking, and X. Anguera, “Towards Robust Speaker Segmentation: The ICSI-SRI Fall 2004 Diarization System,” Proc. NIST RT-04F Workshop, Nov. 2004.
[24] A. Stolcke, X. Anguera, K. Boakye, O. Cetin, F. Grezl, A. Janin, A. Mandal, B. Peskin, C. Wooters, and J. Zheng, “Further Progress in Meeting Recognition: The ICSI-SRI Spring 2005 Speech-to-Text Evaluation System,” Proc. NIST MLMI Meeting Recognition Workshop, 2005.
[25] A. Janin, J. Ang, S. Bhagat, R. Dhillon, J. Edwards, J. Macias-Guarasa, N. Morgan, B. Peskin, E. Shriberg, A. Stolcke, C. Wooters, and B. Wrede, “The ICSI Meeting Project: Resources and Research,” Proc. NIST Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '04)—Meeting Recognition Workshop, 2004.
[26] X. Anguera, C. Wooters, and J. Hernando, “Speaker Diarization for Multi-Party Meetings Using Acoustic Fusion,” Proc. IEEE Workshop Automatic Speech Recognition and Understanding (ASRU '05), 2005.
[27] D.P.W. Elis and J.C. Liu, “Speaker Turn Segmentation Based on Between-Channels Differences,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '04), 2004.
[28] J.G. Fiscus, J. Ajot, M. Michel, and J.S. Garafolo, “The Rich Transcription 2006 Spring Meeting Recognition Evaluation,” Lecture Notes in Computer Science, vol. 4299, p. 319, http://www.nist.gov/speech/tests/rt/rt2006/ spring/pdfsrt06s-SPKR-SAD-results-v5.pdf , 2006.
[29] http://www.nist.gov/speech/testsindex.htm , 2007.
[30] X. Anguera, M. Aguiló, C. Wooters, C. Nadeu, and J. Hernando, “Hybrid Speech/Non-Speech Detector Applied to Speaker Diarization of Meetings,” Proc. IEEE Odyssey 2006: The Speaker and Language Recognition Workshop, June 2006.
[31] X. Anguera, C. Wooters, and J. Hernando, “Automatic Cluster Complexity and Quantity Selection,” Lecture Notes in Computer Science, vol. 4299, pp. 248-256, 2006.
[32] X. Anguera, C. Wooters, and J. Hernando, “Friends and Enemies: A Novel Initialization for Speaker Diarization,” Proc. Int'l Conf. Spoken Language Processing, 2006.
[33] N. Mirghafori et al., “From Switchboard to Meetings: Development of the 2004 ICSI-SRI-UW Meeting Recognition System,” Proc. Int'l Conf. Spoken Language Processing (ICSLP '04), Oct. 2004.
[34] M.S. Brandstein and H.F. Silverman, “A Robust Method for Speech Signal Time-Delay Estimation in Reverberant Rooms,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '97), 1997.
[35] M. Swartling et al., “Direction of Arrival Estimation for Multiple Speakers Using Time-Frequency Orthogonal Signal Separation,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '06), vol. IV, pp. 833-836, 2006.
[36] J.M. Pardo, X. Anguera, and C. Wooters, “Speaker Diarization for Multi-Microphone Meetings Using Only Between-Channel Differences,” Lecture Notes in Computer Science, vol. 4299, pp. 257-264, May 2006.
[37] N. Mirghafori and C. Wooters, “Nuts and Flakes: A Study of Data Characteristics in Speaker Diarization,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '06), 2006.
[38] X. Anguera, C. Wooters, and J.M. Pardo, “Robust Speaker Diarization for Meetings: ICSI RT06s Meetings Evaluation System,” Lecture Notes in Computer Science, vol. 4299, pp. 346-358, 2006.
[39] J.M. Pardo, X. Anguera, and C. Wooters, “Speaker Diarization for Multiple Distant Microphone Meetings: Mixing Acoustic Features and Inter-Channel Time Differences,” Proc. Int'l Conf. Spoken Language Processing (ICSLP '06), pp. 2194-2197, 2006.
[40] A. Gallardo-Antolín, X. Anguera, and C. Wooters, “Multi-Stream Speaker Diarization Systems for the Meetings Domain,” Proc. Int'l Conf. Spoken Language Processing (ICSLP '06), Sept. 2006.

Index Terms:
Speech source separation, speaker diarization, speaker segmentation, meetings recognition, rich transcription
Citation:
Jos? Pardo, Xavier Anguera, Chuck Wooters, "Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information," IEEE Transactions on Computers, vol. 56, no. 9, pp. 1212-1224, Sept. 2007, doi:10.1109/TC.2007.1077
Usage of this product signifies your acceptance of the Terms of Use.