This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Generation of Phonetic Units for Mixed-Language Speech Recognition Based on Acoustic and Contextual Analysis
September 2007 (vol. 56 no. 9)
pp. 1225-1233
This work presents a novel approach to generating phonetic units in order to recognize mixed-language or multilingual speech. Acoustic and contextual analysis is performed to characterize multilingual phonetic units for phone set creation. Acoustic likelihood is utilized for similarity estimation of phone models. The hyperspace analog to language (HAL) model is adopted for contextual modeling and contextual similarity estimation. A confusion matrix combining acoustic and contextual similarities between every two phonetic units is built for phonetic unit clustering. Multidimensional scaling (MDS) method is applied to the confusion matrix for reducing dimensionality. Experimental results indicate that the created phonetic set provides a compact and robust set that considers acoustic and contextual information for mixed-language or multilingual speech recognition.

[1] I.R. Murray and J.L. Arnott, “Towards the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion,” J. Acoustic Soc. Am., vol. 93, no. 2, pp. 1097-1108, 1993.
[2] M. Schröder, “Emotional Speech Synthesis—A Review,” Proc. European Conf. Speech Comm. and Technology (EUROSPEECH '01), vol. 1, pp. 561-564, 2001.
[3] A. Iida, F. Higuchi, N. Campbell, and M. Yasumura, “A Corpus-Based Speech Synthesis System with Emotion,” Speech Comm., vol. 40, nos. 1-2, pp. 161-187, 2003.
[4] H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari, and K. Shikano, “GMM-Based Voice Conversion Applied to Emotional Speech Synthesis,” Proc. European Conf. Speech Comm. and Technology (EUROSPEECH '03), pp. 2401-2404, 2003.
[5] K.E. Cummings and M.A. Clements, “Application of the Analysis of Glottal Excitation of Stressed Speech to Speaking Style Modification,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '93), vol. 2, pp. 207-210, Apr. 1993.
[6] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice Conversion through Vector Quantization,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '98), pp. 655-658, 1988.
[7] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous Probabilistic Transform for Voice Conversion,” IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 131-142, 1998.
[8] A. Kain and M.W. Macon, “Spectral Voice Conversion for Text-to-Speech Synthesis,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '98), 1998.
[9] T. Toda, A.W. Black, and K. Tokuda, “Spectral Conversion Based on Maximum Likelihood Estimation Considering Global Variance of Converted Parameter,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '05), vol. 1, pp. 9-12, Mar. 2005.
[10] H. Duxans, A. Bonafonte, A. Kain, and J. van Santen, “Including Dynamic and Phonetic Information in Voice Conversion Systems,” Proc. Int'l Conf. Speech and Language Processing (ICSLP '04), pp. 5-8, 2004.
[11] E.K. Kim, S. Lee, and Y.H. Oh, “Hidden Markov Model Based Voice Conversion Using Dynamic Characteristics of Speaker,” Proc. European Conf. Speech Comm. and Technology (EUROSPEECH '97), vol. 5, pp. 2519-2522, Sept. 1997.
[12] C.H. Wu, C.C. Hsia, T.H. Liu, and J.F. Wang, “Voice Conversion Using Duration-Embedded Bi-HMMs for Expressive Speech Synthesis,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1109-1116, 2006.
[13] W.H. Tsai and W.W. Chang, “Discriminative Training of Gaussian Mixture Bigram Models with Application to Chinese Dialect Identification,” Speech Comm., vol. 36, no. 3-4, pp. 317-326, 2002.
[14] S. Kullback and R.A. Leibler, “On Information and Sufficiency,” Annals of Math. Statistics, vol. 22, no. 1, pp. 79-86, Mar. 1951.
[15] C.H. Wu and J.H. Chen, “Automatic Generation of Synthesis Units and Prosodic Information for Chinese Concatenative Synthesis,” Speech Comm., vol. 35, nos. 3-4, pp. 219-237, 2001.
[16] H. Kawahara, “Speech Representation and Transformation Using Adaptive Interpolation of Weighted Spectrum: Vocoder Revisited,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '97), vol. 2, pp. 1303-1306, 1997.
[17] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring Speech Representations Using a Pitch Adaptive Time-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds,” Speech Comm., vol. 27, nos. 3-4, pp. 187-207, 1999.
[18] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977.
[19] C.D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing. MIT Press, 1999.
[20] S.H. Chen, S.H. Hwang, and Y.R. Wang, “An RNN-Based Prosodic Information Synthesis for Mandarin Text-to-Speech,” IEEE Trans. Speech and Audio Processing, vol. 6, no. 3, pp. 226-239, 1998.
[21] L.L. Chang et al., “Part-of-Speech (POS) Analysis on Chinese Language,” technical report, Inst. of Information Science Academia Sinica, 1989.
[22] C.H. Wu and Y.J. Chen, “Recovery of False Rejection Using Statistical Partial Pattern Trees for Sentence Verification,” Speech Comm., vol. 43, pp. 71-88, 2004.
[23] J. Tao, Y. Kang, and A. Li, “Prosody Conversion from Neutral Speech to Emotional Speech,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1145-1154, 2006.
[24] X. Sun, “The Determination, Analysis, and Synthesis of Fundamental Frequency,” PhD dissertation, Northwestern Univ., 2002.
[25] K. Levenberg, “A Method for the Solution of Certain Problems in Least Squares,” Quarterly Applied Math., vol. 2, pp. 164-168, 1944.
[26] S. Shott, “Statistics for Health Professionals,” W.B. Sauders, 1990.

Index Terms:
Mixed-language speech recognition, phonetic unit, hyperspace analog to language, multidimensional scaling
Citation:
Chien-Lin Huang, Chung-Hsien Wu, "Generation of Phonetic Units for Mixed-Language Speech Recognition Based on Acoustic and Contextual Analysis," IEEE Transactions on Computers, vol. 56, no. 9, pp. 1225-1233, Sept. 2007, doi:10.1109/TC.2007.1064
Usage of this product signifies your acceptance of the Terms of Use.