Guest Editor's Introduction: Integrating and Using Large Databases of Text, Images, Video, and Audio
• What is the best way to pose multimedia queries to a system? This question has implications for both mixed media and human factors. Can users use multimedia queries more effectively than existing text-only queries?
• Can we do data mining on mixed-media databases? How can we exploit advances in data mining?
• What kind of information can we learn or extract from multimedia databases? Interesting opportunities exist for research on cross-media training and learning. Can we obtain better performance on a task by leveraging information from another source? For example, we want to be able to correlate speech segments to text, or speech to images, faces, or specific persons.
• How can we deal with more data? Too many techniques look really good on paper for small data sets but do not scale up to larger, real-life databases. For example, if a comparison of the similarity of two faces takes one-tenth of a second, and the process is linear, the system will not scale beyond a few thousand faces in an application. A related second issue concerns the quality of the process. The precision retrieval of an image-matching process might be quite good for 500 images, but for a 500,000-image database, the results could be unusable.
• How can a mixed-media database be processed to allow query by concept, rather than the query by keyword, pixel, or image statistics that we currently use? For example, if the user gives a sample image of a football game (human-like blobs in a green background), current systems will find images with similar amounts of green, and so on. The true goal is to find all images related to the concept of football, even if the colors and shapes are completely different.