This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Speaker Verification via High-Level Feature Based Phonetic-Class Pronunciation Modeling
September 2007 (vol. 56 no. 9)
pp. 1189-1198
It has been shown recently that the pronunciation characteristics of speakers can be represented by articulatory feature-based conditional pronunciation models (AFCPMs). However, the pronunciation models are phoneme-dependent, which may lead to speaker models with low discriminative power when the amount of enrollment data is limited. This paper proposes to mitigate this problem by grouping similar phonemes into phonetic classes and representing background and speaker models as phonetic-class dependent density functions. Phonemes are grouped by (1) vector quantizing the discrete densities in the phoneme-dependent universal background models, (2) using the phone properties specified in the classical phoneme tree, or (3) combining vector quantization and phone properties. Evaluations based on 2000 NIST SRE show that this phonetic-class approach effectively alleviates the data spareness problem encountered in conventional AFCPM, which results in better performance when fused with acoustic features.

[1] D.A. Reynolds, T.F. Quatieri, and R.B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10, pp. 19-41, 2000.
[2] D.A. Reynolds, “Channel Robust Speaker Verification via Feature Mapping,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '03), vol. 2, pp. 6-10, 2003.
[3] M.W. Mak, K.K. Yiu, and S.Y. Kung, “Probabilistic Feature-Based Transformation for Speaker Verification over Telephone Networks,” Neurocomputing, special issue on neural networks for speech and audio processing, 2007.
[4] C.J. Leggetter and P.C. Woodland, “Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models,” Computer Speech and Language, vol. 9, no. 2, pp.171-185, 1995.
[5] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score Normalization for Text-Independent Speaker Verification Systems,” Digital Signal Processing, vol. 10, pp. 42-54, 2000.
[6] E. Blaauw, “The Contribution of Prosodic Boundary Markers to the Perceptual Difference between Read and Spontaneous Speech,” Speech Comm., vol. 14, pp. 359-375, 1994.
[7] D. Dahan and J.M. Bernard, “Interspeaker Variability in Emphatic Accent Production in French,” Language and Speech, vol. 39, no. 4, pp. 341-374, 1996.
[8] J. Sussman, E. Dalston, and S. Gumbert, “The Effect of Speaking Style on a Locus Equation Characterization of Stop Place Articulation,” Phonetica, vol. 55, no. 4, pp. 204-255, 1998.
[9] D.P. Kuehn and K.L. Moll, “A Cneradiographic Study of VC and CV Articulatory Velocities,” J. Phonetics, vol. 23, no. 4, pp. 303-320, 1976.
[10] S. Shaiman, S.C. Adams, and M.D.Z. Kimelman, “Timing Relationships of the Upper Lip and Jaw Across Changes in Speaking Rate,” J. Phonetics, vol. 23, pp. 119-128, 1995.
[11] G.R. Doddington, “Speaker Recognition Based on Idiolectal Differences between Speakers,” Proc. European Conf. Speech Comm. (Eurospeech '01), pp. 2521-2524, Sept. 2001.
[12] D. Reynolds et al., “The SuperSID Project: Exploiting High-Level Information for High-Accuracy Speaker Recognition,” Proc. Int'l Conf. Audio, Speech, and Signal Processing, vol. 4, pp. 784-787, Apr. 2003.
[13] A. Adami, R. Mihaescu, D. Reynolds, and J. Godfrey, “Modeling Prosodic Dynamics for Speaker Recognition,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '03), vol. 4, pp. 788-791, 2003.
[14] E. Shriberg et al., “Modeling Prosodic Sequences for Speaker Recognition,” Speech Comm., vol. 4, pp. 455-472, 2005.
[15] W. Andrews et al., “Gender-Dependent Phonetic Refraction for Speaker Recognition,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '02), 2002.
[16] Q. Jin et al., “Combining Cross-Stream and Time Dimensions in Phonetic Speaker Recognition,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '03), 2003.
[17] J.P. Campbell and D.A. Reynolds, “Conditional Pronunciation Modeling in Speaker Detection,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '99), vol. 2, pp. 829-832, 1999.
[18] J. Navratil, Q. Jin, W. Andrews, and J. Campbell, “Phonetic Speaker Recognition Using Maximum Likelihood Binary Decision Tree Models,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '03), vol. 4, pp. 796-799, 2003.
[19] B. Peskin, J. Navratil, J. Abramson, D. Jones, D. Klusáček, D. Reynolds, and B. Xiang, “Using Prosodic and Conversational Features for High-Performance Speaker Recognition: Report from JHU WS'02,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '03), vol. 4, pp. 792-795, 2003.
[20] D. Klusacek, J. Navratil, D.A. Reynolds, and J.P. Campbell, “Conditional Pronunciation Modeling in Speaker Detection,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '03), vol. 4, pp. 804-807, 2003.
[21] K.Y. Leung, M.W. Mak, M.H. Siu, and S.Y. Kung, “Adaptive Articulatory Feature-Based Conditional Pronunciation Modeling for Speaker Verification,” Speech Comm., vol. 48, no. 1, pp. 71-84, 2006.
[22] K. Kirchhoff, “Robust Speech Recognition Using Articulatory Information,” PhD dissertation, Univ. of Bielefeld, 1999.
[23] P. Frber, “Quicknet on Multispert: Fast Parallel Neural Network Training,” Technical Report TR-97-047, Int'l Computer Science Inst., 1998.
[24] J.G. Proakis and J.H.L. Hansen, Discrete-Time Processing of Speech Signals. Macmillan, 1993.
[25] R. Auckenthaler, E. Parris, and M. Carey, “Improving a GMM Speaker Verification System by Phonetic Weighting,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '99), pp.1440-1444, 1999.
[26] K.K. Yiu, M.W. Mak, M.C. Cheung, and S.Y. Kung, “Blind Stochastic Feature Transformation for Channel Robust Speaker Verification,” J. VLSI Signal Processing, vol. 42, no. 2, pp. 117-126, 2006.
[27] B. Xiang, U. Chaudhari, J. Navratil, G. Ramaswamy, and R. Gopinath, “Short-Time Gaussianization for Robust Speaker Verification,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '02), vol. 1, pp. 681-684, 2002.
[28] “The NIST Year 1999 Speaker Recognition Evaluation Plan,” http://www.nist.gov/speech/tests/spk/1999 doc, 1999.
[29] “The NIST Year 2000 Speaker Recognition Evaluation Plan,” http://www.nist.gov/speech/tests/spk/2000 doc, 2000.
[30] J.P. Campbell and D.A. Reynolds, “Corpora for the Evaluation of Speaker Recognition Systems,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '99), vol. 2, pp. 829-832, 1999.
[31] D.A. Reynolds, “HTIMIT and LLHDB: Speech Corpora for the Study of Handset Transducer Effects,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '97), vol. 2, pp.1535-1538, 1997.
[32] S.B. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, Aug. 1980.
[33] L. Gillick and S. Cox, “Some Statistical Issues in the Comparison of Speech Recognition Algorithms,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '89), pp. 532-535, 1989.
[34] S.Y. Kung and M.W. Mak, “Machine Learning for Multi-Modality Genomic Signal Processing,” IEEE Signal Processing Magazine, vol. 23, no. 3, pp. 117-121, May 2006.
[35] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The DET Curve in Assessment of Detection Task Performance,” Proc. European Conf. Speech Comm. (Eurospeech '97), pp. 1895-1898, 1997.

Index Terms:
Speaker verification, pronunciation modeling, articulatory features, phonetic classes, NIST speaker recognition evaluation
Citation:
Shi-Xiong Zhang, Man-Wai Mak, Helen Meng, "Speaker Verification via High-Level Feature Based Phonetic-Class Pronunciation Modeling," IEEE Transactions on Computers, vol. 56, no. 9, pp. 1189-1198, Sept. 2007, doi:10.1109/TC.2007.1081
Usage of this product signifies your acceptance of the Terms of Use.