Pages: pp. 5-8
Video clips represent a fast-growing part of the Web, but potential viewers face a daunting problem: A wide range of video clips are useful only if you have an effective way to search them. But video search proves a tougher technology problem to crack than text search, for a variety of reasons.
Today, most video search utilizes text-based keywords or metadata that's supplied with user-generated clips. This approach isn't terribly accurate. However, some AI researchers and search experts are using AI methods to try to improve the accuracy of video search results.
While at the University of Oxford, researcher Mark Everingham (now at the University of Leeds) collaborated with Josef Sivic and Andrew Zisserman to create a video search system using more than simple face or speech recognition. Using video from the television series Buffy the Vampire Slayer, the team pursued an ambitious goal—to move from an interface of "find me more of the person in this picture," to automatically annotating every video frame with the names of the characters present.
"The key to this work is combining information from both video and text to learn a representation of a person's appearance such that they can be recognized visually and assigned their proper name rather than an anonymous tag," Everingham notes.
Why is it tough to index video clips using the people that appear in them? Effortless tasks for humans remain extremely difficult for computer vision, Everingham says—in particular, determining that a person is in the picture in the first place and determining whether two images are of the same person.
The Buffy video search project uses statistical machine learning—specifically, computer vision methods for face detection and facial-feature localization. These are learned from training data, rather than built by hand.
Using machine learning for tasks such as face detection or object recognition has become common among the computer vision community, Everingham says. But typically, the work uses supervised learning methods, which require tight coupling between the input and desired output (for example, a class label or a face's identity). In contrast, this team's research involves "weakly supervised" methods (a growing research area), where the coupling between input and labels is imprecise or incomplete, says Everingham.
"Two aspects of our work might be considered particularly novel," he says. First, the approach uses two texts—subtitles extracted from a DVD and scripts found on fan Web sites. Neither source by itself provides enough data for the system to learn to recognize a character. But by borrowing a sequence alignment method that's been applied to applications such as gene sequence alignment and speech recognition, the team created a system that automatically aligns the texts.
Second, the technique incorporates data from the video. "The subtitles tell us when a line is spoken. By aligning with the script, we determine who says that line, and in combination we now know who speaks when. A computer vision method for visual speaker detection then separates the cases where someone is speaking, but off-screen, and when they are visible on-screen," Everingham says. "Finally, the cases for which we have a name and know they are on-screen give us training data with which to name the remaining people in the video."
According to Everingham, the most challenging problem remaining is deciding whether two images are of the same person. Even in settings where lighting and facial expression are controlled, that's tough. But in television and movies, it's even harder, because changes in appearance due to factors such as varying lighting or expression are often far larger than the differences in appearance between individuals, he says. Machine learning methods will likely provide the solution to this problem, he believes, because the factors influencing appearance are so complex to model physically.
"Our ultimate goal is to provide automatic annotation of video with information about all the content of the video—not just who is in the frame, but where they are, what other objects are present, what they are doing, and how they might be feeling," says Everingham. Annotation like this for movies, news clips, or home video opens up many possibilities for easy searching, efficient use of archives, and automated narration for people with visual impairments, he says.
Blinkx makes a video search engine that it licenses to the likes of Lycos and FoxNews.com. "We use AI in a couple of interesting ways," says Suranga Chandratillake, Blinkx founder and CTO. His company's search technology tries to objectively analyze video content by using speech recognition and matching the spoken words to context gleaned from a massive database (built on some 1.5 billion Web pages).
The speech recognition technology is important, Chandratillake says, because relying on metadata tags is risky—they can be inaccurate. Most speech recognition research, though, has previously focused on applications such as dictation software that involve one or just a few speakers, or on applications such as automated customer service that involve a limited set of words. "Blinkx can't do that," Chandratillake says.
"We have to listen to everything on the Net." "As well as indexing the voice content, we index textual content from the Web," he says—such as news stories, blogs, product descriptions, and online encyclopedia material.
"We use that to build probabilistic modeling of ideas in that world," Chandratillake says. "We analyze the phonetic transcript in the context around it. The probabilistic analysis helps us better guess what the phonetics are."
For example, on a purely sound level, "recognize speech" sounds a lot like "wreck a nice beach," Chandratillake says. Blinkx uses the modeling to decide whether the context is speech recognition or a tsunami.
Blinkx's technology has also begun to use some visual analysis—for example, reading characters on the screen, such as a name on a sports jersey or a ticker on a news program, he says. The company has also begun to amass a database of famous faces. Probabilistic modeling, using the same large database used for speech recognition, makes these two visual techniques more effective, says Chandratillake.
Among the group's AI challenges, Chandratillake cites adapting the modeling to applications such as speech and visual recognition. The speech data set is so large that you must make trade-offs to make the modeling practical, he says. But the trade-offs can cause problems when you apply the modeling to a new type of recognition. So, his team is focusing on making smarter tactical trade-offs and on modeling as good an assumption set from a smaller set of data, using weighted probabilistic modeling techniques.
Looking ahead, Chandratillake would like to add ever more context into video searching to make it more accurate. "We'd love to do object analysis, so you could say, pick out the Golden Gate Bridge," in the context of a news story.
In a world that already distrusts photographs owing to the sophistication of image-editing tools, similar concerns have arisen regarding video. At Dartmouth University, Hany Farid is working on technology to show whether images or video clips have been doctored. Although it's not a specific goal of Farid's, his research results could possibly inform work on video search as well: Given a plethora of video search results, users will need help to know whether a video clip was actually filmed by Fox News or concocted from imagination in a teenager's basement.
One approach Farid's team uses is the expectation maximization algorithm to detect tampering in de-interlaced video. Video software will often remove interlacing artifacts. (One frame of interlaced video actually travels in two parts, called fields, with horizontal lines split between them. Horizontal visual effects and blips sometimes remain when mismatches between fields exist.) This process gives rise to specific statistical patterns that you can estimate using the EM algorithm, Farid says. (For more on this work, see www.cs.dartmouth.edu/ farid/research/tampering.html and www.cs.dartmouth.edu/ farid/research/cgorphoto.html.)
Farid's research also employs support vector machines—a popular concept in the machine learning community, he notes. An SVM is a classifier for categorizing inputs into two or more categories. Farid's team first trains an SVM on a set of inputs and then applies it to novel data. His research, in progress for about five years, has trained SVMs on wavelet statistics to differentiate computer-generated images from photographic images. (Wavelets, a statistical model, can be used in conjunction with many types of data analysis—in this case, analysis of an image's properties such as scale and orientation.)
The biggest difficulty with this research is the massive amount of data that must be processed, according to Farid. "On the other hand, this also makes it harder for the forger to create a convincing forgery, so that should help us as well. In all of our work, we think about how a forger might specifically tamper with a video or image, formulate what type of statistical pattern this manipulation would produce, and then devise techniques for detecting these patterns."
His team's ultimate goal: develop a suite of image, video, and audio tools for detecting tampering. These tools won't stop the tampering but will make it more difficult, Farid adds. "We currently have two tools completed and several more that we are actively working on," he says. "I expect to continue work on video, image, and audio forensics for several years to come. Our primary audience is law enforcement and media outlets, but I am learning that there are many more potential applications of this work."
The historic victory of IBM's Deep Blue over chess grand master Garry Kasparov in 1997 had the unintended effect of boosting interest in Go. This ancient Asian board game has become a challenge for AI researchers around the world. Go is resistant to Deep Blue's brute-force search of the game tree; the number of possible moves is too large. This inspires researchers to develop hybrid methods combining different methods and algorithms.
"After chess it is a logical next step—a game with simple rules, and a completely controlled environment," says leading computer Go researcher Martin Mueller about Go's growing popularity in the AI community. "Any real-life problem that consists of many loosely connected smaller components can profit. That covers just about any interesting real-life problem."
Go, which originated in China, is frequently described as "deceptively simple." Two players take turns putting black and white stones on a board with a 19 × 19 grid. The aim is to capture and hold territory. A stone or group of stones surrounded by the opponent's stones are captured and removed from the board. The game ends when both players agree that neither side can improve its position.
More than 100 computer Go programs are on the market; examples include SmartGo, Go++, and Many Faces of Go (see the sidebar for related links). The programs employ heuristics, selective search, pattern matching, and hand-crafted rules but are no match for even amateur players.
Part of the reason is that the Go board's large size leads to a nearly infinite number of possible positions. A chess player can choose from about 25 moves, but a Go player has more than 200 options. Commercial chess programs can evaluate about 300,000 positions per second (Deep Blue did 200 million per second), but midway through a game of Go, computer programs can evaluate a few dozen positions at best.
Moreover, capturing Go board positions in algorithms is extremely difficult. "It's really hard to evaluate positions," says SmartGo developer Anders Kierulf. "You have to determine which stones are connected, which stones are alive or dead, and use that information to map the territorial balance as well as the influence." Experienced players are said to recognize patterns in strings of stones early in the game that might have strategic implications later in the game, but they rely on intuition and are often unable to explain why they made certain moves.
These complexities are what attract AI researchers to the game. "Go is particularly appealing for researchers, as it is well defined and constrained by a set of simple rules," says Thore Graepel of Microsoft Research in Cambridge, England. "We are fascinated by a problem that is simple to state but extremely difficult to solve."
Graepel is coauthor of the 2006 paper "Bayesian Pattern Ranking for Move Prediction in the Game of Go" ( www.icml2006.org/ icml_documents/camera-ready/ 110_Bayesian_Pattern_Ran.pdf). "In this paper we focus on the problem of modeling the play of human players by using machine learning techniques to learn from records of historical games," Graepel explains. "The model was trained from 180,000 records of games between expert Go players and aims at mimicking their way of playing in new, as-yet-unseen situations arising in a game. The resulting system gives the best currently published results for expert Go move prediction with a success rate of 34 percent on average compared to previous results of around 25 percent."
Another group of researchers is using the Monte Carlo method to improve computer Go. Developed in the early 1990s, this statistical-sampling approach is widely used in computational physics. Monte Carlo responds to a game situation by running through a game thousands of times and then selecting a move that has produced the best result on average. The Monte Carlo method has become popular in recent years because researchers now have low-cost computers that can run many simulations.
At the 2006 Computer Olympiad, French researcher Remi Coulom won a gold medal with Crazy Stone, a computer Go program that employs Monte Carlo. Crazy Stone won its medal playing on a 9 × 9 board, which is commonly used for both beginning players and computer Go programmers.
Crazy Stone combines Monte Carlo with min-max tree pruning and upper-confidence bounds applied to trees (UCT). The alpha-beta pruning heuristic used in min-max search assigns values to game tree nodes to stop evaluations of moves that are worse than the previously examined move, thus reducing processing time without affecting the final result. UCT, an algorithm developed by Levente Kocsis and Csaba Szepesvari in 2006 ( http://zaphod.aml.sztaki.hu/ papers/ecml06.pdf), chooses the move with the highest upper-confidence bound, which is the sum of the move's average value and the size of its confidence interval.
Most experts believe a computer Go program that can beat the world's best players is decades away. Yet researchers will keep trying, largely because of Go's complexity. As Thore Graepel of Microsoft Research puts it, "Since Go is one of many tasks in which humans can rapidly learn to outperform computers, it does seem likely that the techniques which eventually produce a strong Go playing program will offer insights into machine intelligence in general. For example, one could speculate that methods which are successful at determining the value of Go positions might prove useful for image processing, as the analysis of Go positions is a very visual task."
A Beginner's Introduction to Go:www.cs.umanitoba.ca/ ~bate/BIG/BIG.contents.html
The Computer Go Ladder (an informal competition between computer Go programs):www.cgl.ucsf.edu/go/ladder.html
Ego (Bruce Wilcox's computer Go program that can play with seven different personalities):http://webpages.charter.net/suewilcox
Many Faces of Go:www.smart-games.com
OpenGo (a workbench for programmers writing automated Go opponents):www.inventivity.com/OpenGo
World Computer Go Championship 2006:http://computer-go.softopia.or.jp/ gifu2006/English
World Web Go (Japanese Java Go server):https://home.wwgo.jp