NOVEMBER/DECEMBER 2006 (Vol. 8, No. 6) pp. 4-8
1521-9615/06/$31.00 © 2006 IEEE
Published by the IEEE Computer Society
Published by the IEEE Computer Society
Neural Networks Show New Promise for Machine Vision
|So Many Dimensions, So Little Time|
PDFs Require Adobe Acrobat
Twenty years ago, Geoffrey Hinton had an idea that was ahead of its time: to help computers learn from their mistakes. He wanted to create artificial neural networks that could learn to "see" images and recognize patterns through a kind of trial and error in a way that mimicked some models of human brain function. He and his colleagues developed the first practical method for training neural networks, but there was just one problem.
"We could never make it work properly," Hinton says. At least, not the way they wanted to. Obstacles such as the lack of computing power stood in their way, and controversy erupted when studies in neuroscience suggested that the human brain probably didn't function like their model network.
Flash ahead to 2006, and computing power is no longer a problem. Neuroscience has discovered much about how the brain works, but much of how we see is still a mystery. Hinton, a computer scientist at the University of Toronto and the Canadian Institute for Advanced Research, has discovered some creative strategies to help neural networks fulfill their potential in pattern recognition and artificial intelligence, which he reported in a recent issue of Science (vol. 313, no. 5786, 2006, pp. 504–507). Machine vision is his near-term goal, but the real prize could be insight into the human brain.
Despite the fact that Hinton couldn't meet his original goal in 1986, neural networks have since become computational workhorses and perform a wide variety of statistical analyses in research and commercial software. Their model "neurons" are arranged in successive layers: a bottom layer receives the input data, and each layer processes the data before passing it to the layer above. The network can encode data—take data sets of many variables or dimensions and fold them down into smaller numbers that are computationally more manageable. It can also decode data—unfold encoded data so that higher-dimensional data can emerge. High-dimensional data requires many steps to fold or unfold, and several neuron layers to do the job.
Face images are a prime example of high-dimensional data that researchers would like to be able to analyze with machine vision systems. A computer that could identify people by their faces would be an invaluable security tool, but it hasn't proved easy to design. Because each pixel in a digital image is a variable, the average megapixel camera could take a portrait that would require many layers of neurons to process.
The training method Hinton and his colleagues developed 20 years ago—backward propagation of error, or backpropagation for short—is now a standard procedure for training neural networks. When a network tries to identify a known object but comes up with the wrong answer, users feed the right answer back into the network. The process repeats, and if all goes well, the network eventually learns to provide the right answer. However, backpropagation isn't an effective way to train networks with many layers because the correction (for example, "this is the right answer") only penetrates the outermost neuron layers. The deeper layers are unreachable, so they function as a kind of black box. To keep the deeper layers from getting stuck on the wrong answer, the user must design the network algorithmically to be close to the right answer from the start. That's an especially tricky job when the network has several layers or the right answers aren't well known.
What makes Hinton's latest work special is that it shows how to reach the deepest layers of a neural network: he and doctoral student Ruslan Salakhutdinov found an algorithm that can pretrain each layer as the network is built. Once a network is pretrained, backpropagation becomes an effective method for fine-tuning it.
Hinton and Salakhutdinov used their method to construct networks for three visual tasks: finding compact representations of curved lines, handwritten numbers, and faces in photographs. They joined an encoder network for each task to a decoder and tested whether an image fed into the encoder would successfully re-emerge from the decoder. The networks learned their tasks on thousands of training images, but were asked to encode-decode new images that they had never "seen" before. Even so, this method outperformed two other common methods, and did so while more closely emulating what Hinton suspects is happening in the human brain. It's the realization of the system that he wanted to build back in '86.
So Many Dimensions, So Little Time
Hinton studied psychology before computer science, so even as he began to work on backpropagation, he knew that it alone didn't make a good model for human learning. Neurons send chemical signals forward through adaptive connections in milliseconds, but they can't send rapid signals backward through the same connections.
"When we're learning to see, nobody's telling us what the right answers are—we just look," Hinton says. "Every so often, your mother says 'that's a dog,' but that's very little information. You'd be lucky if you got a few bits of information—even one bit per second—that way. The brain's visual system requires 10 14 [neural] connections. And you only live for 10 9 seconds. So it's no use learning one bit per second. You need more like 10 5 bits per second. And there's only one place you can get that much information—from the input itself."
That idea, known in psychology as generative learning, led him to his encoder-decoder. In this scheme, every pixel in an inputted image becomes an opportunity to learn, so having a megapixel image isn't a bad thing. "If you make mistakes, you make mistakes in millions of pixels, so you get lots of error information," he says.
Twenty years ago, computers weren't powerful enough to run such a system, nor were there data sets large enough to train it. Compounding this was the problem of training deep layers in the neural network. Today, powerful CPUs are plentiful, as are large data sets. You could have predicted that they would both catch up to Hinton's plan, but the last obstacle—finding a way to train the network—wasn't a given. As to how he and Salakhutdinov came up with their pretraining algorithm, he explains it this way: "If you keep thinking hard about anything for 20 years, you'll go a long way." The entire code for the pretraining method is available on Hinton's Web site ( www.cs.toronto.edu/~hinton/).
Robert P.W. Duin, associate professor of electrical engineering, mathematics, and computer science at the Delft University of Technology in the Netherlands, develops learning strategies for both neural networks and a competing methodology called the support vector machine (SVM). He's impressed that Hinton demonstrated his technique's effectiveness for a series of applications, but suspects that mapping the right model for a given problem isn't as straightforward and could only be done by an expert. For his part, Hinton says that adapting his algorithm for different applications doesn't involve a lot of tweaking so much as changing the network's size.
Yiannis Aloimonos, a computer scientist at the University of Maryland, thinks that Hinton's work will be "quite influential," and not just because it opens the door to a variety of new, powerful learning techniques. He points to the April 2006 Columbia Theory Day conference ( www.cs.columbia.edu/ theory/sp06.html), where Princeton University scientist Bernard Chazelle gave a presentation on data-driven algorithm design. "In that presentation, Chazelle argued that algorithmic design as we know it has reached its limitations and now moves into a new phase, where as we design new algorithms, we also have access to gargantuan amounts of data. We do statistical analysis on this data, and we use the results to gain intuition for better modeling of our original problem," Aloimonos remembers.
To him, Hinton's ideas couple well with Chazelle's to form a new methodology, one in which large data sets let researchers map relationships among relevant variables. "We let the data itself tell us how things are related," he says. "Of course, the data cannot tell us everything—we still need to do some modeling, but now our modeling will be guided by the statistical analysis. In this new era, Hinton's deep auto-encoders and dimensionality reducers will become a basic tool for anyone developing new algorithms."
In his own work with Kwabena Boahen of Stanford University, Aloimonos designs chips that he hopes will one day integrate computer vision, hearing, and language capabilities into one cognitive system. It's a difficult enterprise that is tied to our understanding of how those capabilities are entwined in the human brain.
Hinton's strategy harkens back to the very early days of neural networks—the 1950s—when people wanted to train one neural layer at a time. His Science paper marks the first time anyone has penetrated the black box to show that this can indeed be done, even for very deep networks. Encoding and decoding a complex image such as a human face is an important first step toward developing machines that can see in useful ways beyond handwriting analysis or simple shape recognition.
Still, he'd like to eventually develop machines that do even more—ones that see the way we do and therefore act as tools to help us better understand ourselves. "I'm interested in two things. One is how to solve tough problems in artificial intelligence like shape recognition and speech recognition," he says. "But the second is answering the question, how does the brain actually do it? And the new algorithm we've developed for pretraining these neural nets is probably much more like what the brain is using."
Within the machine vision community, there is much discussion about whether neural networks are the best platform for this kind of research. The SVM is a notable competitor because it uses statistical algorithms to identify visual features in a single processing layer. SVMs offer simpler computation and—until now—improved performance over neural networks, as well as transparency. Hinton says with some pride that for the handwritten digit-recognition task, the best reported error rate for SVMs was 1.4 percent, but his pre-trained neural network with backpropagation achieved 1.2 percent. Other researchers have suggested that both SVMs and neural networks have a place in machine vision because some tasks seem better suited to one or the other.
Aloimonos agrees that, historically, there has been tension between researchers who use computer models such as neural networks and those who use purely statistical methods such as SVMs to simulate learning. "In my view, the future will bring the two groups together—the modelers will use the tools of the statistical learners in order to do better modeling." He predicts that the joining of the two concepts will be the hottest topic in the discipline for years to come, especially as data sets grow even larger. "Somehow it is the spirit of the times," he says. "Something felt on a daily basis by the use of Google."