Video-Based Detection Methods: What Can We Know from Watching You?
Guest Editor's Introduction • Dorée Duncan Seligmann • April 2012
It's very likely that you've been on camera from the moment you left home today -- recorded as you rode in the elevator, walked on the street, bought coffee at the local deli, withdrew money, and as you've moved throughout your office building. While you're at work, cameras might be recording the events in your home, capturing the nanny's interaction with your children and when your cat drinks from her water bowl. Your image is part of the crowd scene in the camera advertisement on a billboard in Times Square, passersby are looking at you on the video display at an electronics store, the game system in your living room is analyzing your gestures, and your face is being analyzed as you go through security at the airport.
This Was Not Always So
Cameras were first installed on city streets in the 1960s, and today's prevalent security systems in buildings and public places first took root in the 1970s when two technologies were combined: closed-circuit television (CCTV) systems and video recorders (VCRs). Expensive systems were first put to use by governments and law enforcement to monitor buildings, public areas, railways, highways, and sports arenas. Soon after, people began to expect to find such video surveillance systems in even the most modest local stores. Advances in several technologies, including the digital multiplexer, digital video recording, and camera microchips, along with the Internet and wireless, would lead to a vast expansion of video-based solutions ranging in sophistication and cost. By the 1990s, home security systems allowed users to remotely control a camera's zoom and orientation through a web interface, and the systems themselves could be set to detect changes in the scene or a particular region and send out notifications through phones or email. More expensive systems used face recognition to detect unknown users using computer systems. What began with manually monitored video displays has evolved into a variety of systems with automated processes to scan multiple video streams, detect events or objects of interest, and act on them. For example, some smart home systems can alert services if a person falls down or stops eating, and security systems can alert authorities if a guard is no longer in view, and so on.
New research is going even further, developing methods to predict various types of undesirable events and help prevent them. With cameras pointed at us all the time, from those on our smartphones, tablets, and laptops to the game systems in our homes and all the cameras in our environment, we may be the subject of a lot of analysis that can be used to our benefit. For example, smarter home systems might detect when we’re about to fall down and try to intervene. Such systems can not only differentiate between people and objects but also recognize patterns of behavior — so, rather than just recognizing faces, they can recognize subtle revelatory facial expressions. These methods have the potential to detect actions in public places that are precursors to crimes, a poker player’s “tell,” or a job applicant’s lie or false intent. The combined efforts of several disciplines have led us to this point, and still more are involved in making better sense and use of the video images that are captured.
In this Computing Now special theme section, the collection of articles from the IEEE Computer Society Digital Library highlights how interdisciplinary the broad research area of detection through video really is. We start off with two general articles that serve as introductions to the state of the art and point toward what is yet to come.
The first article, "Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications," is a survey in which authors Rafael Calvo and Sidney D'Mello explore what can be detected by monitoring a person's expression, tone, and actions -- some of which can be detected through the use of video.
Our second article, "Multimedia Analysis + Visual Analytics = Multimedia Analytics," is fashioned as a tutorial. In it, Nancy Chinchor and her colleagues focus (as the title suggests) on the potential of combining multimedia analysis techniques with visual analytics to deal with large data sets from multiple sources to solve new problems in multimedia analytics.
From there, we move to a sampling of recent work in detection that applies some of the techniques described in the earlier articles. We begin with two pieces related to video surveillance: "Surveillance-Oriented Event Detection in Video Streams," by Claudio Piciarelli and Gian Luca Foresti, and "Identifying Rare and Subtle Behaviors: A Weekly Supervised Joint Topic Model," by Timothy Hospedales and his colleagues. Both tackle the problem of detecting the unusual, which by definition prohibits the use of much a priori knowledge (such as what characterizes them or what they look like) and is subject to data sparsity. These works illustrate very different approaches. In the first, the authors apply a method that detects anomalous trajectories -- for example, snipers in a window, people exchanging baggage in a parking lot, and drivers making illegal U-turns. In the second, a data set of traffic footage is subject to a weakly supervised learning and classification method to classify events such as U-turns and near collisions.
It is also important to be able to evaluate the detection systems we will come to rely on. I recently test drove the new Volvo s60 in New York City. To show off the car's new pedestrian-detection capabilities, the dealership had set up a test course on its rooftop. A plastic dummy was placed at the edge of the roof, and I was instructed to drive the car steadily toward it from the other end of the roof without applying the brakes. Sure enough, the car detected the "pedestrian" and stopped on its own accord before continuing on over the edge of the building. In "Pedestrian Detection: An Evaluation of the State of the Art," Piotr Dollar and his colleagues present their evaluation of 16 pretrained pedestrian detectors using their methodology on an extensive data set they've compiled.
The last two articles discuss the use of detection to predict (and, hopefully, eventually prevent) undesirable outcomes. In "Automatically Analyzing Facial-Feature Movements to Identify Human Errors," Maria Jabon, Sun Joo Ahn, and Jeremy Bailson describe the results of their performance-prediction methods linking facial expressions to human performance. Using footage of people fitting screws into holes, they extracted facial points, and linked facial expressions to human performance, evaluating them over time to identify the most valuable pre-error intervals. In "Facial-Expression Analysis for Predicting Unsafe Driving Behavior Activity," Maria Jabon and her colleagues apply similar techniques to active driver safety systems by recognizing facial expressions specifically related to driving behavior that could lead to accidents.
This is, of course, a small sampling of detection methods with a focus on video. We're just on the wave of what's to come, which is all the more exciting because of the massive amounts of rich data that will be available. Much more can and will be monitored about all of our movements — as reported through various sensors, our mobile phones, and even through our status and location postings on social media. Such data will provide more insight for detection and prediction, which will, in turn, feed into new applications. At the same time, there will be new ways to display what is detected, such as augmented-reality systems that incorporate the results of realtime analytics in graphics overlaid on what we see. This is a good time for interdisciplinary collaboration: future systems will depend on the combination of a multitude of technologies from a wide variety of areas, and the best analyses will likely come from bringing together lots of data of very different types.
Dorée Duncan Seligmann is director of collaborative applications research at Avaya Labs. Her research interests include social media, collaborative systems, online analytics, rich media, communication-enabled social networks, context-aware communications, intelligent user interfaces, augmented reality, and knowledge-based graphics. Seligmann has a PhD in Computer Graphics from Columbia University and holds more than 40 patents. She is associate EIC of IEEE MultiMedia and Computing Now, as well as a member of the IEEE Computer Society and ACM. Contact her at firstname.lastname@example.org.