The Community for Technology Leaders

Guest Editors' Introduction: Bridging the Semantic Gap with Computational Media Aesthetics

Chitra Dorai, IBM T.J. Watson Research Center
Svetha Venkatesh, Curtin University, Australia

Pages: pp. 15-17

Content processing and analysis research in multimedia systems has one central objective: develop technologies that help sift and easily access useful nuggets of information from media data streams. A fundamental need exists to analyze, cull, and categorize information automatically and systematically from media data and to manage and exploit it effectively despite rapidly accumulating digital media collections.

However, user expectations of such systems are far from being met, despite continued research for nearly a decade. Currently, only simple, generic, low-level content metadata is made available from analysis. This metadata isn't always useful because it deals primarily with representing the perceived content rather than the semantics of it.

In the last few years, we've seen much attention given to the semantic gap problem in automatic content annotation systems. The semantic gap is the gulf between the rich meaning and interpretation that users expect systems to associate with their queries for searching and browsing media and the shallow, low-level features (content descriptions) that the systems actually compute. For more information on this dilemma, see Smeulders et al., 1 who discuss the problem at length and lament that while "the user seeks semantic similarity, the database can only provide similarity on data processing."


To address this fundamental issue of the data-meaning gulf and to build innovative, high-level, semantics-based content description tools for reliable media location, access, and navigation services, we proposed an approach called computational media aesthetics. 2 We define this approach as the algorithmic study of a variety of image, space, and aural elements employed in media. This study is based on the media elements' usage patterns in production and the associated computational analysis of the principles that have emerged while clarifying, intensifying, and interpreting some event for the audience. We advocate here that if we're going to create tools for automatically understanding video, it's usually best to interpret the data with its maker's eye.

Numerous stakeholders are engaged in this endeavor. They represent the whole multimedia value chain and are involved in content design, authoring, production, archiving, management, distribution, and delivery. Each of these fields brings with it different facets of the semantic issue, thus emphasizing the need and importance of media semantics research in the broadest sense.

The general knowledge-guided semantic analysis in media is exciting to many researchers who are frustrated with the continued focus on low-level features that can't answer high-level queries from real users. They're applying this principled approach to interpreting diverse video domains such as movies, instructional media, surveillance, and so on, with well-grounded research. For this special issue of IEEE MultiMedia, our goal is to show some of the different aspects of this growing research. We attempt to broadly paint a picture of emerging themes and show the influence of computational media aesthetics. In what follows, we briefly describe the contributions of each of the four articles appearing in the current issue.


In "Where Does Computational Media Aesthetics Fit?" Adams provides a comprehensive survey of existing approaches to multimedia content management and examines them according to the tenets of computational media aesthetics. He highlights two types of indices generated as a result of general content processing—structural elements and content entities—and groups popular techniques accordingly. He raises important questions evaluating the effectiveness of different approaches, data sets to benchmark, and semantic inference validation mechanisms. Finally, he positions computational media aesthetics as a viable framework addressing some of the questions he raises.

The second article, "Pivot Vector Space Approach for Audio-Video Mixing," illustrates computational media aesthetics in practice. Here Kankanhalli et al. automate audio-video mixing of home videos. Their approach includes exploiting aesthetic principles used in mixing music and moving images to guide the decision-making process and to adeptly match audio and video clips. They correlate the video shots with audio clips using a set of high-level perceptual audiovisual descriptors extracted and matched on the basis of aesthetic heuristics with pivot space mapping.

In "Sounding Objects," Rochesso et al. take on the issue of sound design for interactive multimedia systems, describing the need for designing sounds that richly convey information about the environment while simultaneously providing aesthetically interesting interface elements. This article explains how a perception-guided sound design can help decipher ecologically relevant auditory phenomena and expressively deliver faithful environmental information. The article argues for the use of cartoon-like physical models of sound—simplified sound descriptions with specific features exaggerated—thus realizing computational efficiency and sharpness in the sounds created.

In "Editing out Video Editing," Davis makes the case for a new computational model for media production that can enable mass production of video for consumers. At its core, media production is a computational process which, based on input media and parameters, can produce new content-exploiting capabilities. This model transforms media creation from an expensive, craft-based production into a standardized process with reusable parts that users can combine for mass customization. The article describes the research issues involved in such a transformation and provides examples of connectable and reusable media structures.


Together, these articles begin to address the fundamental issues spanning the data-meaning gulf by offering a systematic understanding and application of media production methods. However, the efforts toward building computational frameworks to bridge the semantic gap are only beginning. We still need to examine production principles for

  • manipulation of affect and meaning;
  • the representation, extraction, and synthesis of expressive elements in movies and video; and
  • metrics to assess automatic extraction techniques and representational power of expressive elements.

Solutions to these issues will spur the development of novel production practices that will blur the distinction between content annotation and production. Computationally understanding expressive elements will in turn allow new and exciting modes of capture and artistic manipulation of media.

We hope that readers will find this special issue an enjoyable mix and a spotlight on new themes emerging in the field. We're grateful to all the reviewers for carefully poring over the submissions. We also want to thank the IEEE MultiMedia staff for helping us produce this issue.


About the Authors

Bio Graphic
Chitra Dorai is a member of the research staff at the IBM T.J. Watson Research Center, New York, where she leads the e-learning content management and media semantics projects. Her current research focuses on developing technologies for content management and media analysis in various domains that are useful in content-based structuralization, annotation and search, and smart browsing. Dorai received a BTech from the Indian Institute of Technology, Madras, an MS from the Indian Institute of Science, Bangalore, and a PhD from Michigan State University. She's a senior member of the IEEE and a member of the ACM.
Bio Graphic
Svetha Venkatesh is the codirector for the Center of Excellence in Intelligent Operations Management and a professor at the School of Computing at Curtin University of Technology, Perth, Australia. Her research focuses on large-scale pattern recognition, image understanding, and applications of computer vision to image and video indexing and retrieval.
60 ms
(Ver 3.x)