The Community for Technology Leaders

Guest Editors' Introduction: Intelligent Information Retrieval

Yiming , Carnegie Mellon University
Jan , Infoseek

Pages: pp. 30-31

Intelligent information retrieval—finding information truly relevant to a user's need—has become increasingly important because of the dramatically growing availability of online documents. Addressing this task requires synergy between information-retrieval techniques and AI research because automatic learning and prediction about the relevance of a document's content to the user's information need are crucial for satisfactory solutions. This special issue presents a set of creative approaches to new and important challenges in a variety of applications, including detection and tracking of novel events from news stories, automatic assignment of subject categories to articles, customized routing of e-mail messages, search and navigation through the World Wide Web, and indexing and retrieval of multimedia documents.

Event detection and tracking is a relatively new task in the field of information retrieval. The objective is to automatically detect novel events from chronologically ordered streams of news stories and to track events of interest over time. For this task, Yiming Yang and her colleagues investigate the effective use of information-retrieval and machine-learning techniques (see "Learning Approaches for Detecting and Tracking News Events," this issue). They extended existing supervised-learning and unsupervised-clustering algorithms to allow document classification based on both the information content and the temporal aspects of events. Conducting a task-oriented evaluation using Reuters and CNN news stories, they found that agglomerative document clustering is highly effective for retrospective-event detection but that single-pass clustering with time windowing is better for online alerting of novel events. For event tracking under the difficult condition of an extremely small number of positive training examples, k-nearest neighbor classification and a decision-tree approach demonstrated robust learning behavior.

For searching, filtering, and navigating on the Web and the Internet, Dunja Mladenic surveys a broad range of ongoing research on intelligent agents employing information retrieval, machine learning, natural-language processing, and other related methods (see "Text Learning and Related Intelligene Agents"). She compares two frequently used approaches—content-based versus collaborative—for developing intelligent agents. In the first approach, the content (for example, text) plays an important role, while in the second approach, several knowledge sources (for example, several users) are required. Mladenic gives examples of intelligent agents for locating information on the Web, filtering Usenet news, and browsing with WebWatcher, a user-customized agent.

For hypertext browsing, Francis Crimmins and his colleagues propose a two-stage analytic tool for accessing and researching Web content (see See TétraFusion: Information Discovery on the Internet"). The first stage is metasearcher that employs pseudofeedback for automatic query expansion. The second is a data-mining tool adapted from the bibliographic-record domain. This system visually presents results of various analyses for user inspection. Such a system creates an information structure through which users can navigate and lets them specify a reduced domain of interest for further browsing.

Text categorization is the problem of automatically assigning predefined categories to documents. For this domain, Sholom Weiss and his colleagues present a new text-mining approach that uses an adaptive-resampling (boosting) strategy to train their decision-tree classifiers (see "Maximizing Text-Mining Performance"). This approach performed significantly better than the conventional decision-tree approach without boosting. Of all the classifiers evaluated on the Reuters-Apte collection, their approach has the best results reported so far, establishing the benchmark for state-of-the-art text-categorization systems. Applying these techniques to online-banking applications shows strong potential for automated routing of customer e-mail to the appropriate responder.

For multimedia information retrieval, Madirakshi Das, Raghavan Manmatha, and Edward M. Riseman describe a system to retrieve patent-application images of flowers using either natural language (color names) or image-similarity queries (image examples). (This article will appear in the Sept./Oct. issue.) They propose a new approach using the color and spatial domain knowledge for a specialized database. They describe in depth their method to segment flower regions from an image. The final segmented flower region is represented by its color name as well as quantitative color features (average color) for querying by natural language or by similarity.

About the Authors

Yiming Yang is an associate professor at Carnegie Mellon University's Language Technologies Institute and Computer Science Department. Her research interests include information retrieval, text classification, and statistical learning algorithms. She received her PhD in computer science from Kyoto University, Japan. Contact her at the Language Technologies Inst., Carnegie Mellon Univ., Pittsburgh, PA 15213-3072;
Jan Pedersen works for Infoseek. Contact him at the Infoseek Corp., 1399 Moffett Park Dr., Sunnyvale, CA 94089-1134;
70 ms
(Ver 3.x)