, University of Florence, Italy
Pages: pp. 18-21
For many research scientists, multimedia is still a term that indicates a specific area of research in information technology, concerned with the processing, management, distribution, and fruition of composite information available through different media. However, in a very short time we must expect that the term multimedia will inform all the research concerned with computer systems and their applications. The availability of digital information through different media, its processing and combination to obtain more meaningful documents, and its distribution through high-speed networks will only appear as a matter of fact rather than a particular opportunity of computer systems.
On the other hand, the Internet is a new and powerful communication medium. It is changing modalities of retrieving information and also of doing traditional activities like commercial transactions, contacting people, sending data, making presentations, or viewing TV. The Internet represents the inherent support for multimedia and is its principal vehicle of diffusion.
Multimedia will soon mature as a research field. Both hardware and software now make it possible to implement multimedia research achievements in real products and applications. Traditional research lines in signal processing, data communication, computer science, and engineering are now challenged by the new requirements raised by this model. Quality of service (QoS), cooperative processing, retrieval by content, standards, multimodal interaction, and usability are just a few of the major research lines.
Operating systems are still concerned with time-sharing workloads. On the other hand, real-time distributed multimedia applications such as video on demand that deal with continuous media require new architectures that support efficient use of the network channel, efficiently organize objects on disk, and provide QoS guarantees such as minimum network and disk bandwidth. New research focuses on reservation, disk scheduling algorithms, and effective indexing of multimedia documents to support different real-time requirements. Distributed object platforms like the Common Object Request Broker Architecture (Corba), Java, or Distributed Component Object Model (DCOM) have already been developed.
Retrieval by content from large archives is one of the research areas that has received the most attention. It includes several distinct subjects of investigation like annotation, querying, similarity matching, and visualization. Content is either expressed in terms of perceptual features or semantic primitives and their combinations. Sound, image, and video analysis and pattern recognition can automatically extract the most characteristic features and save time in annotating documents. The ultimate goal is to retrieve sound, images, or video whose content resembles one or several user-provided examples.
The true novelty of this research subject is that it suggests a new processing model where processing doesn't fully occur on the computer system, but is realized through an interaction loop where the user's knowledge and feedback are valuable elements in obtaining a satisfactory solution. Since the user is in the processing loop, the true challenge lies in developing new support for effective human-computer dialog. This shifts interest from the problem of putting intelligence in the system, as in traditional recognition systems, to interface design and effective indexing and modeling of users' similarity perception and cognition. Indexing on the World Wide Web poses additional problems concerned with the development of metadata for efficient retrieval and filtering.
Access to large multimedia archives has also raised the problem of copyright infringement and digital watermarking of audio, video, and images based on distinct features. Hundreds of solutions are now available exploiting different features such as signal dependence, restricted keys, embedded watermarks in domains, and perceptual masking.
MPEG-4, MPEG-7, and JPEG 2000 are new opportunities in standardizing transmission, access, and compression of multimedia objects. The Moving Pictures Expert Group's MPEG-4 employs object coding and concurrency for efficient transmission and presentation of multimedia objects and will be a real operating standard (see the Standards article in this issue). MPEG-7 aims to standardize multimedia content description interfaces. It will provide descriptors to enable users to efficiently search, browse, and retrieve multimedia data. The Joint Photographic Experts Group's JPEG 2000 has interesting features like progressive transmission for pixel accuracy and by resolution for different multimedia applications. Additional standards are also requested, which focus instead on user-centered requirements like compression of video streams that leaves only parts of the stream in which the user is interested.
Models and tools for multimedia development are also under development and investigation. Well-established multimedia document models like Hypertext Markup Language (HTML), the Multimedia and Hypermedia Experts Group (MHEG), and so on support the spatial and temporal aspects of multimedia presentation and modeling of interaction. Advanced multimedia applications require that these tools reuse multimedia contents in different presentations and adapt to user preferences.
Multimedia will presumably make the human-computer experience closer to the current real-world experience, where multiple senses and cognitive activity form the basis for decisions and behaviors. Because they can support multimodal interaction, interfaces for multimedia applications have the further challenge of integrating different media to convey information. They can also integrate other technologies to make the interaction experience more natural. Through artificial intelligence techniques, interfaces can learn about users, their tasks and their objectives, and customize the interaction to their needs.
Computer vision and pattern recognition can play a key role in multimedia interfaces. Camera(s) connected to the computer can capture gestures then interpret them as commands to the applications, opening new frontiers for natural interaction and for computer use by disabled people. Such systems require hand, head, and eye tracking; trajectory generation; key gesture recognition; and definition of a task-oriented vocabulary of gestures.
Expression detection and tracking is another promising area of investigation. Expressions can be captured either from cameras or from recognition of human speech and applied to computer agents that play character roles. Augmented reality can produce new displays of information that merge synthetic sensory information into a user's perception of a 3D environment, producing both visual and haptic augmentation. Collaborative environments have become possible and represent an important area of research and application. They will eventually change the way of operation between multiple users. Visual communication can be enhanced with the expression of emotional information—through the reproduction of gestures and facial expressions—and with collaborative annotations.
Usability is a fundamental concern in multimedia applications design. Generally, usability characterizes an application as being usable by the intended users, with reference to intended functions and use. With multimedia applications, usability requires more complex analysis under new viewpoints and needs new measurements of efficiency—in terms of time required to complete a task—and productivity—in terms of pleasure and satisfaction. Structure of the Web site and navigation design are important issues of investigation. The challenge for Web designers is to structure visual and auditory stimuli to maximize visitors' ability to construct meaningful conceptual and navigational paths.
The IEEE International Conference on Multimedia Computing and Systems, which was held in Florence, Italy, from 7 to 11 June 1999, was one of the major events on multimedia in 1999. This issue includes some of the most innovative presentations at the conference.
The five articles in this issue cover different application contexts, from teleconferencing and cooperative working to audio, image, and video retrieval by content. Almost all of these contributions focus on the natural cooperation between the system and the user.
Kiyokawa, Takemura, and Yokoya from the Nara Institute of Science and Technology address the design of collaborative environments for multiple users who share the same task. In particular, the article discusses the problem of rapid prototyping 3D objects. The authors challenge the problem of combining affinity with human perception and 3D graphics immersive modelers that support the creation of both geometric appearance and behaviors. Seamless design represents a new means to provide this capability in an interactive and collaborative way. Multiple users share a virtual workspace and can see their partners as they are, using see-through head-mounted displays (in the see-through mode), or as realistic avatars. Partners' viewing directions are seen as line segments coming out of their heads. Users' coordinates of virtual workspaces are the same, to allow precise calibration and collaboration.
Valente and Dugelay of the Eurécom Institute in Sophia-Antipolis, France, address the subject of face cloning for virtual teleconferencing systems. Typically, in such systems participants are provided with a common meeting space and have their own point of view depending on their position in the virtual space. Faces of participants are cloned to represent their expressions during the session. In this system, facial animation is obtained by morphing a wire-frame model over a number of predefined configurations of facial expressions once the performer's facial expression has been related to the most similar example. Different from previous approaches, the authors enforce realism by using person-dependent textured face models and a visual feedback loop to make analysis and synthesis cooperate and solve problems of lighting, scale, and geometric deformations implied by head motion. The predicted appearance of the real face is generated by the synthetic face model based on Kalman filter estimation. Patterns of contrasted facial features like the eyes or mouth are extracted from the synthesized image and matched with a real user's facial features. The 2D coordinates of the positions estimate the head position, thus closing the visual loop.
Pachet, Roy, and Cazaly present a joint research project by Sony France and University of Paris VI on a relatively little explored subject in content-based retrieval—music selection. The article shows how human preferences can be exploited to improve the quality of a retrieval system based on content. Instead of displaying sets of titles considered in isolation to users, the authors propose building sequences of music titles (music programs) that satisfy particular properties. This exploits the fact that several titles produce particular atmospheres that create expectations for other titles. To achieve this, the authors consider the user's and content provider's goals of repetition and surprise, and optimal exploitation of the catalog, respectively. They created a taxonomy of styles together with technical and content attributes of individual titles. Although currently attributes are input by hand, some of them can be extracted automatically from input signals while others can be inferred. Coherent sequences of music titles are extracted by solving a combinatorial pattern generation problem.
Assfalg and Pala from the University of Florence, Italy, suggest using 3D interactive graphic interfaces for effective access by content to databases containing real-world landscape images. In their article, they present the possibility of using the photographer metaphor to define examples used as references for queries by content. The user navigates freely in a 3D virtual space and takes pictures of the space content through a virtual camera. This permits view change, content selection, and arbitrary inclusion of details in the camera field. Images obtained are matched with database images considering the local similarity of colors and textures. Since the effectiveness of the system depends on the richness of details contained in the virtual world and their adherence to details contained in the images, textures and colors of retrieved images can be extracted and attached to the graphic elements of the scene. Scene population is made interactively from a database of 3D models.
The article by Christel, Olligschlaeger, and Huang of Carnegie Mellon University represents an interesting contribution in the development of new interfaces for accessing video libraries of thousands of segments. As a development of the Informedia project, the authors attempt to exploit the richness of information contained in a video stream to avoid having the user traverse a huge list of segments returned to a query. Geographic maps are built that localize the subject of the video in countries, cities, and places in the world. Video segments are represented through maps that serve both as a means for presentation of clustered video and to access the video library through spatial queries. The words that refer to geographic locations are automatically extracted from the video's audio and text. Although not intended for general use, nevertheless this solution can greatly improve the effectiveness of retrieval for news, documentaries, and sports, and suggests the possibility of mixed modal queries.
I hope that this selection helps researchers to develop new proposals and advanced implementations. I am grateful to William Grosky for his support in the development of this issue, the authors for their effort and help, and all the reviewers for their collaboration.