The Community for Technology Leaders

A Foundational Perspective for Visual Information Retrieval

Qi , University of Texas at San Antonio

Pages: pp. 90-92

X.S. Zhou, Y. Rui, and T.S. Huang, Exploration of Visual Data, Kluwer Academic Publishers, 2003, $106, 187 pp., ISBN 1-4020-7569-3.

One of the most challenging and fast-growing research areas is the content extraction, indexing, and retrieval of multimedia data. New applications—digital libraries, education and training, media commerce, entertainment, and bio-computing, for example—have created a global need for new paradigms and techniques to browse, search, and summarize multimedia collections.

To meet this need, it's becoming a necessity for computer science and electrical engineering curricula to offer courses in visual information retrieval (VIR). The fundamental components in a VIR system include visual content description and representation, similarity and distance measures, or classifiers, indexing schemes, user interaction, and system performance evaluation.


In their textbook, Exploration of Visual Data, the authors have successfully developed an integrated framework for VIR. They based the book on their experience and contributions to multimedia content analysis research and development.

Zhou et al. argue that the semantic gap between low-level visual features and high-level semantic concepts is the primary cause behind inadequate VIR. The textbook's focus is on bridging the semantic gap by seeking better content descriptors for images and video analysis, representation, and indexing, and by developing advanced learning algorithms for user- or perception-guided information access and exploration.

Discussion of chapters

Chapter 1 explains the challenges and briefly reviews state-of-the-art techniques in visual information retrieval. The book is then organized into four parts: low-level visual features (chapters 2, 3, and 4); content-sensitive segmentation, indexing, and nonlinear access of videos (chapters 5 and 6); relevance feedback and small-sample learning algorithms (chapter 7); and the mixed use of textual and visual features (chapter 8).

Low-level visual features

Chapter 2 presents an overview of general visual features, classified as either general or domain-specific. General features include color, texture, shape, and structure; domain-specific features are application-dependent and may include, for example, human faces. Visual feature (content) extraction is the basis of content-based exploration techniques for visual data. This chapter and its associated reference papers establish the current research and development frontier for readers in visual information representation.

Chapter 3, which discusses an edge-based structural feature for image representation, provides a detailed algorithm for effectively extracting such features. Specifically, the authors propose a water-filling algorithm to extract features from the edge map directly, without edge linking or shape representation. The idea is that this highly efficient graph-traverse algorithm will seek measures for the edge length, edge structure, and complexity. The algorithm, in effect, simulates the "flooding of connected canal systems" (p. 19). This structural feature for image content representation presents a necessary complement to existing color, texture, and shape descriptors.

Contrary to the global feature discussed in chapter 3, chapter 4 introduces a probabilistic appearance and structure model to capture local information for images and objects. The joint distribution of k-tuple salient-point feature vectors is factored by components after an independent component analysis, and is used to model the objects' appearance. Experiments yield promising results in image retrieval as well as in robust object localization in cluttered scenes.

Video indexing and access

Of all the media types (text, image, graphic, audio, and video), video is the most challenging to researchers because it combines all other media information into a single data stream.

Chapter 5 introduces the table-of-contents (ToC) concept that's so familiar in books to the video domain. The authors review and evaluate video parsing techniques at various levels (shots, groups, and scenes) and present an effective scene-level ToC construction technique based on intelligent, unsupervised clustering. Clustering's characteristics perform better than without scene-level construction to model the time locality and scene structure.

Examples in this chapter demonstrate use of the scene-based ToC to facilitate users' access to video. A video is summarized into a tree-structured ToC. The video ToC contains three layers of abstraction, from shots, groups of visually similar shots, to semantic scenes, which contains one or more groups that are intertwined in time. Each of these units is represented by an automatically selected key-frame. A user can quickly navigate the video by clicking on any unit in the ToC. The authors' proposed approach provides an open framework for analyzing the video structure. Features—such as audio (speech and background music) and text (closed caption)—other than those described in chapter 5 can be readily incorporated for constructing a video ToC. An appropriate fusion of these multimodalities should result in a more semantically correct video ToC.

In considering channel and buffer constraints for streaming of stored videos over low bit-rate channels, existing commercial solutions are as primitive as the manual trial-and-error approach, which is time-consuming and unsuitable for mass production. In commercial solutions, for example, key-frames are manually selected for streaming. Users need a channel simulator to detect unstreamable frames. Based on the simulation errors, users must manually revise the streaming plan and simulate streaming again until the simulator returns no errors. The contributions of chapter 6 include

  • novel modeling schemes for channel and buffer modeling in the video temporal sampling problem;
  • analysis and development of the corresponding efficient algorithms for finding the global optimal solution; and
  • the extension and analysis of these algorithms for practical application scenarios.

The chapter's proposed algorithms have enabled the automated production of a new form of video streaming over low-bit-rate channels for devices with limited storage.

Learning in relevance feedback

Chapter 7, on relevance feedback and small-sample learning algorithms, is the book's largest chapter, and reviewers consider it the book's most important contribution. To bridge the semantic gap, the authors were among the first to introduce the relevance feedback framework to VIR. Relevance feedback was originally developed in the text information retrieval community, but it has attracted more attention in the image domain and is still an active research field.

It's a formidable task to explain the wide range of techniques involved in so many different schemes of relevance feedback. Based on their long-term research and development experience and understanding of learning techniques in relevance feedback, the authors have nonetheless presented relevance feedback variations in an easily accessible and logical way. The chapter focuses on relevance feedback's evolution from heuristics to an optimal scheme, from feedbacks with positive examples to feedbacks with both positive and negative examples, and from two-class classification to a biased (1 + χ)-class classification.

One of chapter 7's most significant contributions is a proposed optimal learning algorithm specifically designed for relevance feedback during visual information retrieval, including the BiasMap algorithm. The depth of explanation and high-level detail, accompanied by figures and mathematical equations, guide readers through concepts and algorithms. The basic ideas in the relevance feedback framework have inspired many exciting research projects that are ongoing in multimedia information retrieval.

Learning semantic relations

Chapter 8 concentrates on unifying keywords and low-level contents in image retrieval. It introduces a pseudoclassification algorithm, word association via relevance feedback (WARF), for learning the term-similarity matrix during user interaction. Users can apply this learned similarity matrix, specific to the data set and users, to keyword semantic grouping, thesaurus construction,and soft-query expansion during intelligent image retrieval.


The book comprehensively covers the authors' recently developed techniques in VIR. These techniques are presented gradually, from low-level visual feature representation to mid-level learning in relevance feedback, and to semantic retrieval in a hybrid (keyword and visual content) feature space. One of the book's primary accomplishments is in visually presenting readers with the content. Many figures make reading more enjoyable and easier to understand.

This book does, however, leave something to be desired. First, the book offers an overly condensed view of multiple topics in related fields, and in some cases doesn't provide an adequate background and introduction (this is the case with the vast field of video indexing and retrieval). Therefore, it's not suitable as a textbook for undergraduate or early graduate courses—although it makes an excellent reference book for graduate students in related fields of study. A second drawback is that some aspects of visual data representation aren't covered in depth. Examples include shape descriptors, color space and representations, and motion analysis and representation.

Overall, however, this is a well-written book by the authors with a good grasp of the subject (they're experienced researchers, and are practicing in many of the subfields of multimedia). I recommend the book as a good reference source on the state-of-the-art technologies for, as the book's preface states, those

practitioners in the field of image and video computing, … [and] … graduate students and senior undergraduate students of computer science or electrical engineering, working in the areas of image processing, computer vision, or machine learning toward image and video applications.
60 ms
(Ver 3.x)