, IBM Research
Pages: pp. 2-3
Abstract—"Bridging the semantic gap" is an expression often used to describe work on multimedia content understanding. At best, research today is bridging a semantic gap, of which there are many. Better characterizing the overall size and shape of the semantic space for multimedia will help define what is on the other side and ensure that we make progress on bridging the gap.
Keywords—Multimedia, semantics, ontology, benchmarks, data sets
"Mind the gap." Few safety warnings are as important. After all, what is worse than falling into an abyss between here and there? Perhaps only getting run over by a train after doing so. Still, decades of overuse have turned what was once a real call for attention into a watered-down stock phrase. This describes perfectly another faded saying related to gaps: "bridging the semantic gap," which has worn out its welcome in the multimedia community. This is the real gap to mind.
It started with a noble cause of realizing meaning from obscurity or, in essence, finding light from darkness, by simply bridging the gap between them. Because multimedia consists of a myriad of low-level signals across audio and visual modalities, it is natural to want them to be understood or retrieved by computers. Today, multimedia is readily represented, stored, and transmitted, but extracting meaning by machine is difficult. Hence, the gap.
Things go wrong when this multimedia semantic gap problem is turned into a real proposition or, worse, when it is used to describe a working solution. A simple example is when the development of a small set of visual semantic classifiers from a collection of training images or video data supposedly bridges the semantic gap. 1 Similarly, work on learning to associate textual tags with images, 2 inferring a user's semantic intent for visual content-based search, 3 and extracting and matching visual semantic concepts for video retrieval 4 is excessively described today as bridging the semantic gap. At best, this and other related work in multimedia research bridges small, individual gaps.
The path to redemption begins with calling each what it is—bridging a semantic gap. Progress shall ensue by better characterizing the overall multimedia semantic gap problem and calling out the separate subproblems. Clearly, bridging the semantic gap includes challenges related to audio-visual feature extraction, machine learning, concept detection, multimedia retrieval, ontologies, and context exploitation. Many of these individual topics are the focus of multimedia research today. However, of these, new significant work is most needed to understand where the ultimate bridge should lead to. 5 Although it is bad to have "a bridge too far," it is worse to have a "bridge to nowhere." And there has been too little progress on characterizing the required size and shape of the semantic spaces for describing multimedia content.
Large-Scale Concept Ontology for Multimedia (LSCOM) attempts to fill out some of the semantic space by defining thousands of semantic concepts for news video. 6 Numerous other efforts have created small academically oriented annotated image data sets related to Web photos, faces, human actions, multimedia events, and so on (see the sidebar for details). Otherwise, more traditional resources such as the US Library of Congress Thesaurus for Graphic Materials I (TGM I), which provides approximately 7,000 subject terms for cataloging visual works by libraries, is a poor fit for describing scenes in today's digital photos or video. As individual efforts, each is inadequate for describing all the aspects in which audio-visual material could be of interest across facets related to objects, places, scenes, activities, events, and people. However, all these resources can be put together and, with some effort on harmonization, can create the beginning of a practical multimedia semantic ontology.
If we can make progress on this, perhaps then bridging the semantic gap will return to vogue. Only next time we'll mean it.
ImageNet – Tens of millions of images indexed according to tens of thousands of WordNet synsets ( www.image-net.org)
Labeled Faces in the Wild – 13,000 face images collected from the Web, 1,680 with two or more examples of named people ( http://vis-www.cs.umass.edu/lfw)
TinyImages – Millions of photos corresponding to tens of thousands of English nouns ( http://groups.csail.mit.edu/vision/TinyImages)
International Association for Pattern Recognition (IAPR) TC-12 Benchmark – Tens of thousands of natural images depicting sports, actions, people, animals, cities, and landscapes ( www.imageclef.org/photodata)
Large Scale Concept Ontology for Multimedia – 2,000 semantic concepts related to events, objects, locations, people, and programs ( www.lscom.org)
Human Motion Database – Large video database for human motion recognition ( http://serre-lab.clps.brown.edu/resources/HMDB/index.htm)
Hollywood – Human actions in movies ( www.irisa.fr/vista/actions)
Please welcome Ian Burnett to the IEEE MultiMedia editorial board. He is currently a professor and head of school in the School of Electrical and Computer Engineering (SECE) at RMIT University, Melbourne, Australia. Before that, he was a lecturer and associate professor at the University of Wollongong, School of Electrical, Computer, and Telecommunications Engineering (SECTE). His research interests focus on multimedia systems, media content description, multimedia semantics (such as MPEG-7 and tagging), MPEG-21 Multimedia Framework, media delivery and adaptation, social media, and spatial audio and 3D audio reproduction. He has published more than 150 journal and conference papers in these fields and serves on the program committee of the major conferences in multimedia, signal processing, and information retrieval. Burnett has served as a significant contributing member to the International Organization for Standardization MPEG Working Group, making contributions to MPEG-4 and MPEG-7 Audio, MPEG-21, MPEG-A, and MPEG-B, and has served as co-project editor for several MPEG standards. He was also chair of MPEG Multimedia Description Schemes (MDS) subgroup from 2004 to 2007. Burnett has a PhD in electrical and electronic engineering from the University of Bath, UK. Contact him at email@example.com or visit his website, www.rmit.edu.au/browse;ID=6ib8o0jg1k2t, for a more complete curriculum vita and a list of publications.