Issue No. 05 - May (2012 vol. 34)
M. Hasegawa-Johnson , Dept. of Electr. & Comput. Eng., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
S. M. Chu , Human Language Technol. Group, IBM T.J. Watson Res. Center, Yorktown Heights, NY, USA
Hao Tang , HP Labs., Palo Alto, CA, USA
T. S. Huang , Dept. of Electr. & Comput. Eng., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm-linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical model-based distance metrics, 2) our advocated use of the cosine distance metric yields consistent increases in the speaker clustering performance as compared to the commonly used euclidean distance metric, 3) our partially supervised speaker clustering concept and strategies significantly improve the speaker clustering performance over the baselines, and 4) our proposed LSDA algorithm further leads to state-of-the-art speaker clustering performance.
speaker recognition, content-based retrieval, Gaussian processes, graph theory, learning (artificial intelligence), multimedia databases, pattern clustering, statistical model-based distance metrics, partially supervised speaker clustering, content-based multimedia retrieval, content-based multimedia indexing, content-based multimedia processing, multimedia databases, speech utterance, unsupervised speaker clustering process, speaker-discriminative acoustic feature transformation, universal speaker prior model, discriminative speaker subspace, speaker-discriminative distance metric, directional scattering property, Gaussian mixture model mean supervector representation, cosine distance metric, distance metric learning algorithm-linear spherical discriminant analysis, GALE database, GMM mean supervector representation, vector-based distance metrics, Measurement, Acoustics, Speech, Feature extraction, Pipelines, Training data, Training, distance metric learning., Speaker clustering, partial supervision
M. Hasegawa-Johnson, S. M. Chu, Hao Tang, T. S. Huang, "Partially Supervised Speaker Clustering", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 34, no. , pp. 959-971, May 2012, doi:10.1109/TPAMI.2011.174