This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Architecture, User Interface, and Enabling Technology in Windows Vista's Speech Systems
September 2007 (vol. 56 no. 9)
pp. 1156-1168
Existing speech recognition systems have claimed high accuracy for specific tasks such as dictation. What is new in Windows Speech recognition for Vista is a combination of high accuracy and high usability for the end-to-end speech experience. This paper describes the architecture, user interface and key technologies that make up the speech system incorporated in Microsoft Windows Vista. It outlines some of the challenges encountered in providing a speech-based interface to a system as complex and extensible as the modern desktop PC, as well as the technology developments that have made this possible. In particular, the paper describes key elements of the speech user interface and how they maintain the user's ability to control the system despite limitations in the underlying recognition technology. The paper also explains how feedback and adaptation systems are used to tailor the experience to each user and their particular style of speaking/use of language.

[1] S. Burger, Z.A. Sloane, and J. Yang, “Competitive Evaluation of Commercially Available Speech Recognizers in Multiple Languages,” Proc. Language Resource and Evaluation Conf. (LREC '06), 2006.
[2] Discontinued Products Information in the Comp. Speech FAQ, http://www.speech.cs.cmu.edu/comp.speech FAQ6.html, 2007.
[3] Detailed Product Information for Both Products, http://www.nuance.comproducts, 2007.
[4] X. Huang, A. Acero, F. Alleva, M.Y. Hwang, L. Jiang, and M. Mahajan, “Microsoft Windows Highly Intelligent Speech Recognizer: Whisper,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '95), May 1995.
[5] L. Deng and X. Huang, “Challenges in Adopting Speech Recognition,” Comm. ACM, vol. 47, no. 1, pp. 69-75, Jan. 2004.
[6] X. Huang, A. Acero, and H. Hon, Spoken Language Processing. Prentice Hall, 2001.
[7] X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, and R. Rosenfeld, “The SPHINX-II Speech Recognition System: An Overview,” Technical Report CMU-CS-92-112, Carnegie Mellon Univ., Jan. 1992.
[8] M. Rozak, “Talk to Your Computer and Have It Answer Back with the Microsoft Speech API,” Microsoft Systems J., Jan. 1996.
[9] X. Huang, A. Acero, F. Alleva, M. Hwang, L. Jiang, and M. Mahajan, “From Sphinx-II to Whisper: Making Speech Recognition Usable,” Automatic Speech and Speaker Recognition, Advanced Topics, C. Lee, F. Soong, and K. Paliwal, eds., Kluwer Academic, 1996.
[10] SAPI Information from the Microsoft Speech Site, http://www.microsoft.com/speech/download/ oldsapi5.asp, 2007.
[11] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, “The HTK Book Version 2.2,” Entropic Cambridge Research Laboratory, Dec. 1999.
[12] P.C. Woodland, J.J. Odell, V. Valtchev, and S.J. Young, “Large Vocabulary Continuous Speech Recognition Using HTK,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '94), vol. 2, pp. 125-128, Apr. 1994.
[13] P.C. Woodland, T. Hain, S.E. Johnson, T.R. Niesler, A. Tuerk, E.W.D. Whittaker, and S.J. Young, “The 1997 HTK Broadcast News Transcription System,” Proc. DARPA Broadcast News Transcription and Understanding Workshop, pp. 41-48, 1998.
[14] J.J. Odell, P.C. Woodland, and T. Hain, “The CUHTKEntropic 10xRT Broadcast News Transcription System,” Proc. DARPA Broadcast News Workshop, pp. 271-275, 1999.
[15] T. Hain, P.C. Woodland, T.R. Niesler, and E.W.D. Whittaker, “The 1998 HTK System for Transcription of Conversational Telephone Speech,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '99), pp. 57-60, 1999.
[16] P.C. Woodland, T. Hain, G. Evermann, and D. Povey, “CU-HTK March 2001 Hub5 System,” Proc. Large Vocabulary Continuous Speech Recognition Hub5 Workshop, May 2001.
[17] P.C. Woodland, H.Y. Chan, G. Evermann, M.J.F. Gales, D.Y. Kim, X.A. Liu, D. Mrva, K.C. Sim, L. Wang, K. Yu, J. Makhoul, R. Schwartz, L. Nguyen, S. Matsoukas, B. Xiang, M. Afify, S. Abdou, J.-L. Gauvain, L. Lamel, H. Schwenk, G. Adda, F. Lefevre, D. Vergyri, W. Wang, J. Zheng, A. Venkataraman, R.R. Gadde, and A. Stolcke, “SuperEARS: Multi-Site Broadcast News System,” Proc. Fall Rich Transcription Workshop (RT '04), Nov. 2004.
[18] M.J.F. Gales, Y.K. Do, P.C. Woodland, Y.C. Ho, D. Mrva, R. Sinha, and S.E. Tranter, “Progress in the CU-HTK Broadcast News Transcription System,” IEEE Trans. Audio, Speech and Language Processing, vol. 14, no. 5, pp. 1513-1525, Sept. 2006.
[19] J.V. West, Tablet PC Quick Reference. Microsoft Press, 2002.
[20] D. Klementiev, “Software Driving Software: Active Accessibility-Compliant Apps Give Programmers New Tools to Manipulate Software,” MSDN Magazine, Apr. 2000.
[21] Developing International Software, second ed., chapter 23. Microsoft Press, 2003.
[22] D. Yu, L. Deng, X. He, and A. Acero, “Use of Incrementally Regulated Discriminative Margins in MCE Training for Speech Recognition,” Proc. Interspeech Conf., Sept. 2006.
[23] R.P. Lippmann, “Speech Recognition by Machines and Humans,” Speech Comm., vol. 22, pp. 1-15, 1997.
[24] S.B. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
[25] Documentation for .Net Framework, http://msdn.microsoft.com/en-us/netframework default.aspx, Aug. 2006.
[26] “Speech Recognition Grammar Specification (SRGS) v1.0” and “Speech Synthesis Markup Language (SSML) v1.0,” World Wide Web Consortium (W3C) recommendation, 2004.
[27] MSDN, “What Is the Indexing Service,” MSDN Library Platform SDK, Aug. 2006.
[28] R.C. Dorf, Modern Control Systems. Addison-Wesley, 1992.
[29] D. Yu, M. Mahajan, P. Mau, and A. Acero, “Maximum Entropy Based Generic Filter for Language Model Adaptation,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '05), Mar. 2005.
[30] C.J. Leggetter, “Improved Acoustic Modeling for HMMs Using Linear Transformations,” PhD dissertation, Dept. of Eng., Univ. of Cambridge, Feb. 1995.
[31] J.L. Gauvain and C.H. Lee, “Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains,” IEEE Trans. Speech and Audio Processing, vol. 2, pp. 291-298, 1994.

Index Terms:
Adaptation, Operating Systems, Speech recognition and synthesis, User interfaces
Citation:
Julian Odell, Kunal Mukerjee, "Architecture, User Interface, and Enabling Technology in Windows Vista's Speech Systems," IEEE Transactions on Computers, vol. 56, no. 9, pp. 1156-1168, Sept. 2007, doi:10.1109/TC.2007.1065
Usage of this product signifies your acceptance of the Terms of Use.