Issue No. 03 - July-September (2010 vol. 17)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MMUL.2010.65
John R. Smith , IBM Research
I have often wondered what kinds of answers are locked in my digital photo collection. It has long been said that a picture is worth a thousand words, but when the picture is our own, surely it's also worth a thousand memories. A large collection, therefore, must also have a correspondingly large amount of memories. But what exactly does a collection know? Does it know how many trips I've taken? Has it learned my family's preferred vacation activities? Can it help recall special moments? Can it remind me of birthdays and anniversaries? Can it tell me my wife's favorite color?
Speaking of wives, my wife and I share a common memory block. We always forget if we were married on August 28th or 29th. This is embarrassing. But, because we have so many good memories of our wedding, we don't sweat the date. Each year I settle it by looking at the date from our photos. But what would be really great is if I could recite the color of the dress she wore on each of our anniversaries, or if I was automatically made aware that we were behind on our beach trips this summer, or if there was provable evidence I haven't been to a ballgame in a while. All of this could be possible with help from our large photo collection.
Picture taking is becoming so inexpensive and easy that people can create high-fidelity accounts of their lives through digital photos. IDC has forecast that we will take 500 billion photos in 2010. 1 If the current growth rate continues, there will be 200 pictures taken on average per year by 2015 for each person on the planet. Given this density, our digital photo collections should be able to deliver all kinds of insight about our lives and serve as an oracle of our routines, behaviors, and preferences, and provide a digital episodic memory.
Digital photo metadata is a key part of this knowledge puzzle. Various content descriptors extracted from digital photos provide valuable information as well. Together they can indicate the who, what, when, and where for each photo—the basic building blocks for learning and pattern discovery.
Knowing the date and time of each picture is usually straightforward. Most digital cameras today embed Exchangeable Image File Format (EXIF) tags with the date and time of capture. When a digital photo is transferred to a computer, additional date and time fields are created that can be a backup to EXIF. Knowing the location of each photo is also becoming common. Many cameras today are GPS enabled or use other methods of spatial localization, such as cell-tower triangulation, to create geotags. Photo-management software also now makes it easy to author geotags by dragging photos to maps. Taking the time to do this is a worthwhile investment for improving search and retrieval alone. Knowing the people captured in each photo generally requires human input. But today's software makes it easy by automatically extracting faces and learning to recognize recurring people.
Having a collection know other information is more of a mixed bag. Automatic recognition of semantic content of photos across scenes, objects, people, activities, and locations is challenging, but is an active area of research in the multimedia community. Even as image-classification capabilities improve, there are already benefits to using today's technology. For example, automatically detecting various scene categories, such as city, nature, park, ball field, and beach, can be done reliably these days. And popular landmarks, such as the Statue of Liberty, can be automatically recognized.
Progress on image classification has been aided by development of large annotated photo and video data sets that provide training data for recognizing different semantic categories. For example, the Large Scale Concept ontology for Multimedia (see http://www.lscom.org/) has created a taxonomy and annotated video data set for more than 1,000 semantic categories. The ImageNet project (see http://www.image-net.org/) has used crowd-sourcing to amass a large, tagged collection of more than 11 million photos for 15,000 semantic categories (WordNet synsets, see http://wordnet.princeton.edu/). At the time of this writing ImageNet provides more than 2,000 photos with wedding-related tags that could be useful for teaching computers to recognize wedding scenes automatically, for example.
Knowing the who, what, when, and where for each photo can allow a large photo collection to deliver tremendous insights. For example, photos can be grouped using this information to identify events (when and where), social clusters (who and when), activities (what and where), and so on. Other detected correlations could reveal new insights and discoveries, for example, that a cluster of people (family) get together the same day each year (birthday party), or that all trips to Asia have been in the summer, or someone always wears blue on special occasions. It can also reveal what is missing, like not having traveled to Nova Scotia or Venice yet. However, knowing is of course only part of the equation. A smart photo collection won't be able to provide extra resources and time. Not yet at least.
Contact John R. Smith at firstname.lastname@example.org.