This Article 
 Bibliographic References 
 Add to: 
Advances in Robust Multimodal Interface Design
September/October 2003 (vol. 23 no. 5)
pp. 62-68
Sharon Oviatt, Oregon Health and Science University

A well-designed multimodal interface that fuses two or more information sources can be an effective means of substantially reducing recognition uncertainty. Robustness advantages in multimodal systems have been demonstrated for different modality combinations, for varied tasks, and in different usage environments. Perhaps most importantly, the error suppression achievable with a multimodal system, compared with a unimodal one, can be in excess of 40 percent. In addition to improving overall recognition rates, a multimodal interface can perform in a more stable and effective manner across a variety of challenging user groups and real-world settings. This article reviews recent demonstrations of enhanced robustness for three types of multimodal interfaces, including ones that process speech and pen, speech and lip movement, and multibiometrics (physiological and behavioral) inputs. It concludes by discussing general design strategies for optimizing the robustness of future multimodal systems.

1. W. Sumby and I. Pollack,"Visual Contribution to Speech Intelligibility in Noise," J. Acoustical Soc. of America, vol. 26, 1954, pp. 212-215.
2. H.L. Pick and E. Saltzman,"Modes of Perceiving and Processing Information," Modes of Perceiving and Processing Information, H.L. Pick, Jr. and E. Saltzman, eds., John Wiley, 1978, pp. 1-20.
3. B. Stein and M. Meredith,The Merging of the Senses, MIT Press, 1993.
4. R.R. Murphy,"Biological and Cognitive Foundations of Intelligent Sensor Fusion," IEEE Trans. Systems, Man, and Cybernetics, Part A: Systems and Humans, vol. 26, no. 1, 1996, pp. 42-51.
5. S.L. Oviatt et al., "Designing the User Interface for Multimodal Speech and Gesture Applications: State-of-the-Art Systems and Research Directions," Human Computer Interaction, vol. 15, no. 4, 2000, pp. 263-322.
6. C. Benoit et al., "Audio-Visual and Multimodal Speech-Based Systems," Handbook of Multimodal and Spoken Dialogue Systems: Resources, Terminology and Product Evaluation, D. Gibbon, I. Mertins&R. Moore, eds., Kluwer, 2000, pp. 102-203.
7. S.L. Oviatt,"Multimodal Interfaces," Handbook of Human-Computer Interaction, J. Jacko and A. Sears, eds., 2002, pp. 286-304.
8. S.L. Oviatt et al., "A Rapid Semi-Automatic Simulation Technique for Investigating Interactive Speech and Handwriting," Proc. Int'l Conf. Spoken Language Processing, J. Ohala et al., eds., vol. 2, Univ. of Alberta, 1992, pp. 1351-1354.
9. P.R. Cohen et al., "Quickset: Multimodal Interaction for Distributed Applications," Proc. 5th ACM Int'l Multimedia Conf., ACM Press, 1997, pp. 31-40.
10. J. Bers, S. Miller,, and J. Makhoul,"Designing Conversational Interfaces with Multimodal Interaction," DARPA Workshop on Broadcast News Understanding Systems, 1998, pp. 319-321.
11. A. Cheyer,"MVIEWS: Multimodal Tools for the Video Analyst," Proc. Int'l Conf. Intelligent User Interfaces (IUI 98), ACM Press, 1998, pp. 55-62.
12. A. Waibel et al., "Multimodal Interfaces for Multimedia Information Agents," Proc. Int'l Conf. Acoustics, Speech and Signal Processing (IEEE-ICASSP), vol. 1, IEEE Press, 1997, pp. 167-170.
13. L. Wu, S. Oviatt,, and P. Cohen,"Multimodal Integration: A Statistical View," IEEE Trans. Multimedia, vol. 1, no. 4, 1999, pp. 334-342.
14. S. Bangalore and M. Johnston,"Integrating Multimodal Language Processing with Speech Recognition," Proc. Int'l Conf. Spoken Language Processing (ICSLP 2000), vol. 2, Chinese Friendship Pub., 2000, pp. 126-129.
15. M. Denecke and J. Yang,"Partial Information in Multimodal Dialogue," Proc. Int'l Conf. Spoken Language Processing (ICSLP 2000), Chinese Friendship Pub., 2000, pp. 624-633.
16. H. McGurk and J. MacDonald,"Hearing Lips and Seeing Voices," Nature, vol. 264, 1976, pp. 746-748.
17. E.D. Petajan, “Automatic Lipreading to Enhance Speech Recognition,” PhD thesis, Univ. of Illinois, Urbana-Champaign, 1984.
18. M.J. Tomlinson, M.J. Russell, and N.M. Brooke, “Integrating Audio and Visual Information to Provide Highly Robust Speech Recognition,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 821-824, May 1996.
19. J. Cassell et al., eds., Embodied Conversational Agents, MIT Press, 2000.
20. S. Dupont and J. Luettin,"Audio-Visual Speech Modeling for Continuous Speech Recognition," IEEE Trans. Multimedia, vol. 2, no. 3, 2000, pp. 141-151.
21. G. Potamianos and C. Neti,"Stream Confidence Estimation for Audio-Visual Speech Recognition," Proc. Int'l Conf. Spoken Language Processing (ICSLP 2000), vol. 3, B. Yuan, T. Huang, and X. Tang, eds., Chinese Friendship Pub., 2000, pp. 746-749.
22. R. Brunelli and D. Falavigna,"Person Identification Using Multiple Cues," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, no. 10, 1995, pp. 955-966.
23. B. Fröba, C. Rothe,, and C. Küblbeck,Statistical Sensor Calibration for Fusion of Different Classifiers in a Biometric Person Recognition Framework, LNCS 1857, Springer-Verlag, 2000, pp. 362-371.
24. L. Hong, and A. Jain., "Integrating Faces and Fingerprints for Personal Identification," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, 1998, pp. 1295-1307.
25. J. Kittler et al., "Combining Evidence in Multimodal Personal Identity Recognition Systems," Proc. 1st Int'l Conf. Audio Video-Based Personal Authentication, 1998, pp. 327-334.
26. A. Ross, A. Jain,, and J.Z. Qian,Information Fusion in Biometrics, LNCS 2091, Springer-Verlag, 2001, pp. 354-359.
27. A. Jain and A. Ross,"Learning User-Specific Parameters in a Multibiometric System," Proc. Int'l Conf. Image Processing (ICIP), 2002.
28. S.L. Oviatt,"Taming Recognition Errors with a Multimodal Architecture," Comm. ACM, vol. 43, no. 9, ACM Press, 2000, pp. 45-51.
29. S.L. Oviatt,"Mutual Disambiguation of Recognition Errors in a Multimodal Architecture," Proc. Conf. Human Factors in Computing Systems (CHI 99), ACM Press, 1999, pp. 576-583.
30. S.L. Oviatt,"Multimodal System Processing in Mobile Environments," Proc. 13th Ann. ACM Symp. User Interface Software Technology (UIST 2000), ACM Press, 2000, pp. 21-30.
31. S.L. Oviatt,"Breaking the Robustness Barrier: Recent Progress on the Design of Robust Multimodal Systems," Advances in Computers, vol. 56, M. Zelkowitz, ed., Academic Press, 2002, pp. 305-341.
32. D. Massaro and D. Stork,"Sensory Integration and Speechreading by Humans and Machines," Am. Scientist, vol. 86, 1998, pp. 236-244.
33. M. Hennecke, D. Stork,, and K. Prasad,"Visionary Speech: Looking Ahead to Practical Speechreading Systems," Speechreading by Humans and Machines: Models, Systems, and Applications, NATO ASI Series, Series F: Computer and Systems Sciences, vol. 150, Springer-Verlag, 1996, pp. 331-349.
34. A. Senior, C. Neti,, and B. Maison,"On the Use of Visual Information for Improving Audio-Based Speaker Recognition," Proc. Auditory-Visual Speech Processing (AVSP), 1999, pp. 108-111.
35. A. Jain, L. Hong, and Y. Kulkarni,"A Multimodal Biometric System Using Fingerprint, Face and Speech," Proc. 2nd Int'l Conf. Audio- and Video-Based Biometric Person Authentication, 1999, pp. 182-187.
36. P.R. Cohen et al., "Synergistic Use of Direct Manipulation and Natural Language," Proc. Conf. Human Factors in Computing Systems (CHI 89), ACM Press, 1989, pp. 227-234.
37. S.L. Oviatt,"Multimodal Interactive Maps: Designing for Human Performance," Human-Computer Interaction, vol. 12, 1997, pp. 93-129.

Sharon Oviatt, "Advances in Robust Multimodal Interface Design," IEEE Computer Graphics and Applications, vol. 23, no. 5, pp. 62-68, Sept.-Oct. 2003, doi:10.1109/MCG.2003.1231179
Usage of this product signifies your acceptance of the Terms of Use.