Issue No.02 - April/June (2005 vol.12)
pp: c2, 1
Published by the IEEE Computer Society
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MMUL.2005.32
The field of multimedia is about cross-cutting analysis of digital objects from all media types. Looking at recent developments, we can see that the multimedia community has come a long way in achieving this goal. For example, we now have algorithms that operate in real time for video analysis and understanding, including detection of objects, recognition of events, and identification of people along with their ethnicity, gender, and age.
Such techniques succeed because a range of other multimedia information from a variety sources—such as the context, environment, sensory devices, and other applications—are used to infer semantic details of the video stream contents. The drive behind these advances comes from several sources, but the fuel for most of these developments comes from the entertainment and security industries.
Considering areas for success
Let's consider the success behind the security industry. By correlating information extracted from scanners, points of sale, voice recorders, computer files, and automated transactions (possibly through a predefined ontology), we can generate a better "picture" of what's contained in a video file.
In multimodal biometrics systems, particularly those that operate under synchronous multimodality, the centerpiece is the ability to integrate information from different media types. On that note, we should acknowledge the recognition techniques that have raised the accuracy levels of biometrics-based identification systems. Although not as accurate as laboratory-based techniques, facial scans, voice patterns, and fingerprints have become increasingly more effective.
Meanwhile, for laboratory-based techniques—such as DNA—a controlled lab environment is essential. Although DNA is just another biometric, it does present some major differences. For DNA testing, an actual sample is needed, instead of an impression (such as a picture or a recording). This makes DNA processing an invasive procedure. Another major difference is that it doesn't employ feature extraction and template matching, which are common components of multimedia object processing. Instead, DNA matching represents the comparison of an actual sample with those in the database.
We have numerous other application areas where our work may make a substantial impact. Some are obvious—such as asset management and medicine—and some are new discoveries. For example, we could automate the task of patent examination to a great extent by correlating text, optical character recognition (OCR) data, diagrams, images, numbering, and record-based (meta)data. Given that patent disclosures have a specific format, fusing invention information is a simpler task.
Structurally, a patent write-up contains a barcode at the top, followed by some record-based data, an abstract, the drawings, and the textual description of the invention. The text part also has an overall organization. Background information is followed by the description of the drawing, then a description of the preferred embodiment, and finally the claims.
Drawings may be photos, diagrams, flowcharts, signatures, or hand-written notes. These may contain characters, words, numbers, or special symbols. Text, as well as information extracted from images and diagrams, may be structured by means of an ontology that's most applicable to the area of invention.
The correlation between text and drawings comes from many sources—most notably the numbering of captions, which provide correspondence to the diagrams—and numbers appearing in the textual descriptions, which correspond to the numbered components of the drawings. For example, the textual information about a diagram may be a set of itemized descriptions that we'd find in a section called "Description of the Preferred Embodiment."
Given such an ontological representation of the invention information, we could examine the patent application against prior art in an automated way. We could then search, recognize, and retrieve patents and invention disclosures based on metadata, text, and drawings. We could find drawings, images, graphs, and text elements by concept, similarity, sketch, colors, incorporated shapes, and other visual attributes.
At the moment, patent examiners do all of this manually, and they've mastered this art through many years of experience. Imagine the cost savings in terms of time and effort when someone finally automates the process of examining patent disclosures with correlating components of different media types.
To my colleagues in the field of multimedia, I say keep up the good work. Your research impacts these sample markets and many others. The best is yet to come! As always, your notes and ideas are welcome. You may send them to me at firstname.lastname@example.org.
Editorial board updates
Now I'd like to switch our focus and discuss some of the changes coming our way at IEEE MultiMedia. After many years of dedicated service, three of our editorial board members are retiring. Nikolaos Bourbakis, Amit Sheth, and HongJiang Zhang have completed their terms with the board, having served with dedication and enthusiasm. I wish these colleagues the best in their current and new endeavors.
We also have two new additions to our editoral board: Daniel Ellis and Jane Hunter. As you can see from their short biographies, they bring a great deal of experienceok forward to continuing the quality and improvement of this magazine with their help.
Daniel Ellis is an assistant professor of electrical engineering at Columbia University. His research interests are in signal processing and machine learning for analysis of general audio and music, automatic speech recognition, computational models of human sound processing and organization, and visualization and browsing tools for audio and speech databases. Ellis received his MS and PhD degrees from the Massachusetts Institute of Technology, both of which were in electrical engineering. He's a member of the IEEE, the Acoustical Society of America, the International Speech Communications Association, and the Audio Engineering Society.
Jane Hunter is a distinguished research fellow at the Distributed Systems Technology (DSTC) Cooperative Research Center at the University of Queensland, Australia. Her research interests are in multimedia metadata models/ontologies, standards, and schemas; indexing, search, browse, retrieval, and filtering tools for multimedia; and semantic interoperability, digital libraries, and collections management for cultural, educational, and scientific institutions. Hunter has a BEng in metallurgy and a BSc in materials from the University of Queensland, as well as a PhD in computer science from Cambridge University, UK. She's on the editorial boards of MPEG-7 ISO/IEC 15838-2 Information Technology—Multimedia Content Discription Interface—Part 2: Description Definition Language and the Journal of Web Semantics.