This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
PRESENCE: A Human-Inspired Architecture for Speech-Based Human-Machine Interaction
September 2007 (vol. 56 no. 9)
pp. 1176-1188
Recent years have seen steady improvements in the quality and performance of speech-based human-machine interaction driven by a significant convergence in the methods and techniques employed. However, the quantity of training data required to improve state-of-the-art systems seems to be growing exponentially, and performance appears to be asymptoting to a level that may be inadequate for many real-world applications. This suggests that there may be a fundamental flaw in the underlying architecture of contemporary systems, as well as a failure to capitalize on the combinatorial properties of human spoken language. This paper addresses these issues and presents a novel architecture for speech-based human-machine interaction inspired by recent findings in the neurobiology of living systems. Called PRESENCE 'PREdictive SENsorimotor Control and Emulation' - this new architecture blurs the distinction between the core components of a traditional spoken language dialogue system and, instead, focuses on a recursive hierarchical feedback control structure. Cooperative and communicative behavior emerges as a by-product of an architecture that is founded on a model of interaction in which the system has in mind the needs and intentions of a user, and a user has in mind the needs and intentions of the system.

[1] R.K. Moore, “Research Challenges in the Automation of Spoken Language Interaction,” Proc. COST278 and ISCA Tutorial and Research Workshop (ITRW) on Applied Spoken Language Interaction in Distributed Environments (ASIDE '05), 2005.
[2] R.K. Moore, “Modelling Data Entry Rates for ASR and Alternative Input Methods,” Proc. Eighth Int'l Conf. Spoken Language Processing (INTERSPEECH 2004-ICSLP), 2004.
[3] F. Jelinek, “Five Speculations (and a Divertimento) on the Themes of H. Bourlard, H. Hermansky, and N. Morgan,” J. Speech Comm., vol. 18, pp. 242-246, 1996.
[4] R.K. Moore, “A Comparison of the Data Requirements of Automatic Speech Recognition Systems and Human Listeners,” Proc. European Conf. Speech Comm. and Technology (EUROSPEECH '03-INTERSPEECH '03), pp. 2582-2584, 2003.
[5] E. Keller, “Towards Greater Naturalness: Future Directions of Research in Speech Synthesis,” Improvements in Speech Synthesis, E.Keller, G. Bailly, A. Monaghan, J. Terken, and M. Huckvale, eds., John Wiley & Sons, 2001.
[6] R.K. Moore, “Spoken Language Processing: Piecing Together the Puzzle,” J. Speech Comm., vol. 49, pp. 418-435, , 2007.
[7] A. Newell, J. Barnett, J. Forgie, C. Green, D. Klatt, J. Licklider, J. Munson, R. Reddy, and W. Woods, Speech Understanding Systems. Elsevier, 1973.
[8] D. Klatt, “Review of the ARPA Speech Understanding Project,” J.Acoustical Soc. Am., vol. 62, pp. 1345-1366, 1977.
[9] B.T. Lowerre, “The HARPY Speech Recognition System,” PhD dissertation, Dept. of Computer Science, Carnegie Mellon Univ., 1976.
[10] H. Hermansky, “Should Recognizers Have Ears?” Speech Comm., vol. 25, pp. 3-27, 1998.
[11] C.-H. Lee, “From Decoding-Driven to Detection-Based Paradigms for Automatic Speech Recognition,” Proc. Eighth Int'l Conf. Spoken Language Processing (INTERSPEECH '04-ICSLP), 2004.
[12] O. Scharenborg, O.V. Wan, and R.K. Moore, “Capturing Fine-Phonetic Detail in Speech through Automatic Classification of Articulatory Features,” Proc. ISCA Workshop Speech Recognition and Intrinsic Variation, May 2006.
[13] N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Ostendorf, P. Jain, H. Hermansky, D. Ellis, G. Doddington, B. Chen, O. Cretin, H. Bourlard, and M. Athineos, “Pushing the Envelope—Aside,” IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 81-88, 2005.
[14] M. Ostendorf, V. Digalakis, and O. Kimball, “From HMMs to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 4, pp. 360-378, 1996.
[15] J. Sun and L. Deng, “An Overlapping-Feature-Based Phonological Model Incorporating Linguistic Constraints: Applications to Speech Recognition,” J. Acoustical Soc. Am., vol. 111, no. 2, pp.1086-1101, 2002.
[16] O. Scharenborg, L. ten Bosch, L. Boves, and D. Norris, “Bridging Automatic Speech Recognition and Psycholinguistics: Extending Shortlist to an End-to-End Model of Human Speech Recognition,” J. Acoustical Soc. Am., vol. 114, no. 6, pp. 3023-3035, 2003.
[17] O. Scharenborg, D. Norris, L. ten Bosch, and J.M. McQueen, “How Should a Speech Recogniser Work?” Cognitive Science, vol. 29, pp.867-918, 2005.
[18] V. Maier and R.K. Moore, “An Investigation into a Simulation of Episodic Memory for Automatic Speech Recognition,” Proc. Ninth European Conf. Speech Comm. and Technology (EUROSPEECH '05-INTERSPEECH '05), 2005.
[19] K. Kirchhoff and S. Schimmel, “Statistical Properties of Infant-Directed vs. Adult-Directed Speech: Insights from Speech Recognition,” J. Acoustical Soc. Am., vol. 117, no. 4, pp. 2224-2237, 2005.
[20] R. Lippmann, “Speech Recognition by Machines and Humans,” Speech Comm., vol. 22, pp. 1-16, 1997.
[21] G. Rizzolatti, L. Fadiga, V. Gallese, and L. Fogassi, “Premotor Cortex and the Recognition of Motor Actions,” Cognitive Brain Research, vol. 3, pp. 131-141, 1996.
[22] G. Rizzolatti and L. Craighero, “The Mirror-Neuron System,” Ann. Rev. Neuroscience, vol. 27, pp. 169-192, 2004.
[23] M. Wilson and G. Knoblich, “The Case for Motor Involvement in Perceiving Conspecifics,” Psychological Bull., vol. 131, no. 3, pp.460-473, 2005.
[24] V. Gallese, C. Keysers, and G. Rizzolatti, “A Unifying View of the Basis of Social Cognition,” Trends in Cognitive Science, vol. 8, no. 9, pp. 396-403, 2004.
[25] C. Frith, “Attention to Action and Awareness of Other Minds,” Consciousness and Cognition, vol. 11, pp. 481-487, 2002.
[26] G. Rizzolatti and M.A. Arbib, “Language within Our Grasp,” Trends in Neuroscience, vol. 21, pp. 188-194, 1998.
[27] F. Aboitiz, R. Garcia, E. Brunetti, and C. Bosman, “Imitation and Memory in Language Origins,” Neural Networks, vol. 18, p. 1357, 2005.
[28] E. Kohler, C. Keysers, M. Allessandra Umilta, L. Fogassi, V. Gallese, and G. Rizzolatti, “Hearing Sounds, Understanding Actions: Action Representation in Mirror Neurons,” Science, vol. 297, pp. 846-848, 2002.
[29] F. Pulvermüller, “Brain Mechanisms Linking Language and Action,” Nature Neuroscience Rev., vol. 6, pp. 576-582, 2005.
[30] L. Fadiga, L. Craighero, G. Buccino, and G. Rizzolatti, “Speech Listening Specifically Modulates the Excitability of Tongue Muscles: A TMS Study,” European J. Neuroscience, vol. 15, pp.399-402, 2002.
[31] S.M. Wilson, A.P. Saygin, M.I. Sereno, and M. Iacoboni, “Listening to Speech Activates Motor Areas Involved in Speech Production,” Nature Neuroscience, vol. 7, no. 7, pp. 701-702, 2004.
[32] A.M. Liberman and I.G. Mattingly, “The Motor Theory of Speech Perception Revised,” Cognition, vol. 21, pp. 1-36, 1985.
[33] P.B. Denes and E.N. Pinson, The Speech Chain: The Physics and Biology of Spoken Language. Anchor Press, 1973.
[34] W.T. Powers, Behaviour: The Control of Perception. Aldine, 1973.
[35] M.M. Taylor, P.S.E. Farrell, and J.G. Hollands, “Perceptual Control and Layered Protocols in Interface Design: II. The General Protocol Grammar,” Int'l J. Human-Computer Studies, vol. 50, pp.521-555, 1999.
[36] J.-C. Junqua, “The Influence of Acoustics on Speech Production: A Noise-Induced Stress Phenomenon Known as the Lombard Reflex,” Speech Comm., vol. 20, pp. 13-22, 1996.
[37] B. Lindblom, “Explaining Phonetic Variation: A Sketch of the H&H Theory,” Speech Production and Speech Modeling, W. Hardcastle and A. Marchal, eds., pp. 403-439, Kluwer Academic, 1990.
[38] P.K. Kuhl, “Early Language Acquisition: Cracking the Speech Code,” Nature Reviews: Neuroscience, vol. 5, pp. 831-843, 2004.
[39] R. Grush, “The Emulation Theory of Representation: Motor Control, Imagery, and Perception,” Behavioral and Brain Sciences, vol. 27, pp. 377-442, 2004.
[40] R. Grush, “Perception, Imagery, and the Sensorimotor Loop,” AConsciousness Reader, F. Esken and H.-D. Heckmann, eds., Schoeningh Verlag, 1998.
[41] S.J. Cowley, “Simulating Others: The Basis of Human Cognition,” Language Sciences, vol. 26, pp. 273-299, 2004.
[42] V.G.J. Gerdes and R. Happee, “The Use of an Internal Representation in Fast Goal-Directed Movements: A Modeling Approach,” Biological Cybernetics, vol. 70, pp. 513-524, 1994.
[43] M. Studdart-Kennedy, “Mirror Neurons, Vocal Imitation, and the Evolution of Particulate Speech,” Mirror Neurons and the Evolution of Brain and Language, M.I. Stamenov and V. Gallese, eds., pp. 207-227, Benjamins, 2002.
[44] M. Meltzoff and K. Moore, “Explaining Facial Imitation: A Theoretical Model,” Early Development and Parenting, vol. 6, pp.179-192, 1997.
[45] J. Hawkins, On Intelligence. Times Books, 2004.
[46] V.B. Mountcastle, “An Organizing Principle for Cerebral Function: The Unit Model and the Distributed System,” The Mindful Brain, G.M.Edelman and V.B. Mountcastle, eds., MIT Press, 1978.
[47] J. Hawkins and D. George, “Hierarchical Temporal Memory,” white paper, Numenta Inc., 2006.
[48] F. Wörgötter and B. Porr, “Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms,” Neural Computation, vol. 17, pp. 245-319, 2005.
[49] A.G. Barto, “Adaptive Critics and the Basal Ganglia,” Models of Information in the Basal Ganglia, J.C. Houk, J. Davis, and D. Beiser, eds., pp. 215-232, MIT Press, 1995.
[50] A.H. Maslow, “A Theory of Human Motivation,” Psychological Rev., vol. 50, pp. 370-396, 1943.
[51] C. Cherry, On Human Communication: A Review, a Survey and a Criticism. MIT Press, 1978.
[52] W.J.M. Levelt, “Monitoring and Self-Repair in Speech,” Cognition, vol. 14, pp. 41-104, 1983.
[53] F. Fallside, “Synfrec: Speech Synthesis from Recognition Using Neural Networks,” Proc. ESCA Workshop Speech Synthesis, pp. 237-240, 1990.
[54] I.S. Howard and M.A. Huckvale, “Training a Vocal Tract Synthesizer to Imitate Speech Using Distal Supervised Learning,” Proc. 10th Int'l Conf. Speech and Computer (SPECOM '05), pp. 159-162, 2005.
[55] J.S. Bridle and M.P. Ralls, “An Approach to Speech Recognition Using Synthesis by Rule,” Computer Speech Processing, F. Fallside and W. Woods, eds., Prentice Hall, 1985.
[56] A.M. Liberman and I.G. Mattingly, “The Motor Theory of Speech Perception Revised,” Cognition, vol. 21, pp. 1-36, 1985.
[57] M. Blomberg, R. Carlson, K. Elenius, B. Granström, S. Hunnicutt, R. Lindell, and L. Neovius, “Speech Recognition Based on a Text-to-Speech Synthesis System,” Proc. European Conf. Speech Technology, J. Laver and M.A. Jack, eds., pp. 369-372, 1987.
[58] R.K. Moore, “Critique: The Potential Role of Speech Production Models in Automatic Speech Recognition,” J. Acoustical Soc. Am., vol. 99, no. 3, pp. 1710-1713, 1996.
[59] E. McDermott and A. Nakamura, “Production-Oriented Models for Speech Recognition,” IEICE Trans. Information & Systems, vol. E89-D, no. 3, pp. 1006-1014, 2006.
[60] LEGO MINDSTORMS, http://dx.doi.org/10.1016/j.specom.2007.01.011http:/ www.mindstorms.com, 2007.
[61] R. Hofe Research Homepage, http://www.dcs.shef.ac.uk~robin/, 2007.
[62] D.K. Roy and A.P. Pentland, “Learning Words from Sights and Sounds: A Computational Model,” Cognitive Science, vol. 26, pp.113-146, 2002.
[63] A. Drygajlo, P. Prodanov, G. Ramel, M. Meisser, and R. Siegwart, “On Developing Voice-Enabled Interface for Interactive Tour-Guide Robots,” J. Advanced Robotics, 2003.
[64] W.L. Abler, “On the Particulate Principle of Self-Diversifying Systems,” J. Social and Biological Structures, vol. 12, pp. 1-13, 1989.
[65] A. Turing, “Computing Machinery and Intelligence,” Mind, vol. 59, no. 236, pp. 433-460, 1950.
[66] E. Oztop, M. Kawato, and M. Arbib, “Mirror Neurons and Imitation: A Computationally Guided Review,” Neural Networks, vol. 19, pp. 254-271, 2006.
[67] R.K. Moore, “Towards a Unified Theory of Spoken Language Processing,” Proc. Fourth IEEE Int'l Conf. Cognitive Informatics (ICCI '05), 2005.
[68] Y. Wang, “On Cognitive Informatics,” Brain and Mind, vol. 4, pp.151-167, 2003.
[69] R.K. Moore, “Cognitive Informatics: The Future of Spoken Language Processing?” Proc. 10th Int'l Conf. Speech and Computer (SPECOM '05), 2005.

Index Terms:
automatic speech recognition, speech synthesis, spoken language dialogue
Citation:
Roger Moore, "PRESENCE: A Human-Inspired Architecture for Speech-Based Human-Machine Interaction," IEEE Transactions on Computers, vol. 56, no. 9, pp. 1176-1188, Sept. 2007, doi:10.1109/TC.2007.1080
Usage of this product signifies your acceptance of the Terms of Use.