Issue No. 01 - January-March (2012 vol. 19)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MMUL.2012.9
John R. Smith , IBM Research
"Mind the gap." Few safety warnings are as important. After all, what is worse than falling into an abyss between here and there? Perhaps only getting run over by a train after doing so. Still, decades of overuse have turned what was once a real call for attention into a watered-down stock phrase. This describes perfectly another faded saying related to gaps: "bridging the semantic gap," which has worn out its welcome in the multimedia community. This is the real gap to mind.
It started with a noble cause of realizing meaning from obscurity or, in essence, finding light from darkness, by simply bridging the gap between them. Because multimedia consists of a myriad of low-level signals across audio and visual modalities, it is natural to want them to be understood or retrieved by computers. Today, multimedia is readily represented, stored, and transmitted, but extracting meaning by machine is difficult. Hence, the gap.
Things go wrong when this multimedia semantic gap problem is turned into a real proposition or, worse, when it is used to describe a working solution. A simple example is when the development of a small set of visual semantic classifiers from a collection of training images or video data supposedly bridges the semantic gap. 1 Similarly, work on learning to associate textual tags with images, 2 inferring a user's semantic intent for visual content-based search, 3 and extracting and matching visual semantic concepts for video retrieval 4 is excessively described today as bridging the semantic gap. At best, this and other related work in multimedia research bridges small, individual gaps.
The path to redemption begins with calling each what it is—bridging a semantic gap. Progress shall ensue by better characterizing the overall multimedia semantic gap problem and calling out the separate subproblems. Clearly, bridging the semantic gap includes challenges related to audio-visual feature extraction, machine learning, concept detection, multimedia retrieval, ontologies, and context exploitation. Many of these individual topics are the focus of multimedia research today. However, of these, new significant work is most needed to understand where the ultimate bridge should lead to. 5 Although it is bad to have "a bridge too far," it is worse to have a "bridge to nowhere." And there has been too little progress on characterizing the required size and shape of the semantic spaces for describing multimedia content.
Large-Scale Concept Ontology for Multimedia (LSCOM) attempts to fill out some of the semantic space by defining thousands of semantic concepts for news video. 6 Numerous other efforts have created small academically oriented annotated image data sets related to Web photos, faces, human actions, multimedia events, and so on (see the sidebar for details). Otherwise, more traditional resources such as the US Library of Congress Thesaurus for Graphic Materials I (TGM I), which provides approximately 7,000 subject terms for cataloging visual works by libraries, is a poor fit for describing scenes in today's digital photos or video. As individual efforts, each is inadequate for describing all the aspects in which audio-visual material could be of interest across facets related to objects, places, scenes, activities, events, and people. However, all these resources can be put together and, with some effort on harmonization, can create the beginning of a practical multimedia semantic ontology.
If we can make progress on this, perhaps then bridging the semantic gap will return to vogue. Only next time we'll mean it.
John R. Smith is a senior manager of Intelligent Information Management at IBM T.J. Watson Research Center. Contact him at firstname.lastname@example.org.