Issue No. 05 - October (1995 vol. 10)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/64.464929
<p>Computers are fast becoming a ubiquitous part of our lives, brought on by their rapid increase in performance and decrease in cost. With their increased availability comes the corresponding increase in our appetite for information. This trend is reflected in the astronomical growth in the number of Internet hosts, the number of home pages on the World Wide Web, and the corresponding network traffic. For example, the 1994 collision of the Shoemaker/Levy comets with Jupiter increased the demand for Jupiter images at one host by 40,000 over a one-week period. Vast amounts of useful information are being made widely available, and people are using it routinely for education, decision-making, finance, and entertainment.</p> <p>The advent of the Information Age places increasing demands on the notion of universal access. For information to be truly accessible to all -- especially the technologically naive -- anytime, anywhere, one must seriously address the issue of user interface. An interface based on a user's own language is particularly appealing, because it is the most natural, flexible, and efficient means of communication among humans.</p> <p>After many years of research, spoken input to computers is just beginning to pass the threshold of practicality. The last decade has witnessed dramatic improvements in speech recognition technology, to the extent that high-performance algorithms and systems are becoming available. In some cases, the transition from laboratory demonstration to commercial deployment has already begun. Speech input capabilities are emerging that can provide functions like voice dialing ("Call home"), call routing ("I would like to make a collect call"), simple data entry (entering a credit card number), and preparation of structured documents (performing a radiology report).</p> <p>Speech recognition is a very challenging problem in its own right, with a well-defined set of applications. However, many tasks that lend themselves to spoken input -- such as making travel arrangements or selecting a movie -- are in fact exercises in interactive problem solving. The solution is often built up incrementally, with both the user and the computer playing active roles in the "conversation." Therefore, several language-based input and output technologies must be developed and integrated to reach this goal. Regarding the former, speech recognition must be combined with natural language processing so the computer can understand spoken commands (often in the context of previous parts of the dialogue). On the output side, some of the information provided by the computer -- and any of the computer's requests for clarification -- must be converted to natural sentences, perhaps delivered verbally.</p> <p>In a typical conversational system, the spoken input is first processed through the speech recognition component. The natural language component, working in concert with the recognizer, produces a meaning representation. For information retrieval applications illustrated in this figure, the system can use the meaning representation to retrieve the appropriate information in the form of text, tables and graphics. If the information in the utterance is insufficient, the system may choose to query the user for clarification. Speech output can also be generated by processing the information through natural language generation and text-to-speech synthesis. Throughout the process, discourse information is maintained and fed back to the speech recognition and language understanding components.</p> <p>This article illustrates the usefulness of an intuitive, speech-based interface using Galaxy, a system under development at MIT's Laboratory for Computer Science that enables universal information access using spoken dialogue. Galaxy differs from current spoken language systems in a number of ways. First, it is distributed and decentralized: Galaxy uses a client-server architecture to allow sharing of computationally expensive processes (such as large vocabulary speech recognition), as well as knowledge intensive processes. Second, it is multidomain, intended to provide access to a wide variety of information sources and services while insulating the user from the details of database location and format. Finally, it is extensible; users can add new knowledge domain servers to the system incrementally.</p>
V. W. Zue, "Navigating the Information Superhighway Using Spoken Language Interfaces," in IEEE Intelligent Systems, vol. 10, no. , pp. 39-43, 1995.