, Carnegie Mellon University
Pages: pp. 34-35
With the advent of relatively cheap, large online storage capacities and advances in digital compression, comprehensive sources of text, image, video, and audio (TIVA) can be stored and made available for research and applications. The processing of a single medium has seen significant progress, especially for pure text sources. Also, images are frequently processed and made available through a query-by-example procedure (that is, find another image that has similar colors, textures, and shapes as this one).
However, the processing of a combination of multiple types of data has not been explored as thoroughly. Most TIVA sources were not produced with computer processing in mind. In contrast with text processing, few effective methods exist for understanding or even searching the content of combined TIVA sources. Intelligent, content-understanding systems can greatly improve the usefulness of the huge quantities of existing material from these sources. Collecting and intelligently integrating several of these media sources open up opportunities for novel applications of existing AI techniques and for further development of intelligent technologies. Unfortunately, there is no clear categorization or organization of the various research efforts concerning mixed-media databases.
However, in recent years several workshops have focused on multimedia databases, learning, and their integration, thus spurring research. This special issue presents examples of current research and potential future contributions of intelligent, integrated systems using TIVA sources. These articles demonstrate exchange and cross-fertilization across the fields of vision, speech processing, natural-language processing, machine learning, and information retrieval. They all describe interesting combinations across multiple media, looking at how large amounts of data can be extracted, integrated into another system, and used in applications.
In "Named Faces: Putting Names to Faces," Ricky Houghton elegantly combines a variety of approaches—face recognition in images, OCR over the text on the screen, and Web spiders. The resulting application constructs a database and allows queries to that database.
In "Learning to Recognize Speech by Watching Television," Photina Jang and I describe a method that leverages the closed-captioned text to provide training data for any speech-recognition system.
"Image Retrieval Agent: Integrating Image Content and Text," by Jesus Favela and Victoria Meza, looks at ways to search for images found on the Web. The authors combine query by example in the visual domain with traditional text retrieval.
Finally, in "Retrieving Related TV News Reports and Newspaper Articles," Yasuhiko Watanabe, Yoshihiro Okada, Kengo Kaneji, and Yoshitaka Saka discuss a way to align television and newspaper articles on the same news item.
In terms of potential impact, fields ranging from medicine (mixed-media patient records and data, evolving over time), to entertainment (video, audio, and images accessed over the Web), to education (multimedia training materials, searching historical and scholarly collections), to business and military information gathering could all benefit from advances in the processing of combined voice, image, video, and audio data.
Ample opportunities exist for cross-disciplinary work: digital signal processing is used for the basic processing of images, voice, and video. Research in very large databases provides clustering techniques and various tree-based access methods. AI and machine learning provide tools for classification and learning. Statistics provides tools to discover trends and analyze the data. Information retrieval provides time-tested fundamental text indexing and search techniques, which can be combined with visual and audio material.
Several generic long-term problems are open research questions: