Issue No. 01 - January/February (2006 vol. 21)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIS.2006.10
Practical Pattern Matching
Humans are fascinated by patterns, and they can spot them well—in fact, that's one area where humans excel over computers. But research is producing interesting competition as scientists discover and employ new methods of automated pattern recognition. Practical applications include finding genes, detecting cancer-causing chemicals in molecules, searching out potential terrorists, and predicting terrorist threat levels, as well as recognizing speech patterns and creating a nanotechnology resource library.
No new genes?
At the University of Toronto, Brendan Frey is leading a group of scientists who are using AI techniques to analyze molecular-biology data. One of their projects involves using a factor graph they developed called GenRate to discover and evaluate genes in mouse tissues. Factor graphs let researchers describe a system with complex variables, such as gene location in DNA as well as gene length and function.
"What a factor graph is useful for," says Frey, "is describing a scoring function that tells you how good each setting of the variables is."
Using samples from over 1 million probes along DNA in 37 different mouse tissues, the scientists used their factor graph to determine which bits of DNA are expressed, or activated to read protein. In some tissues, the DNA is expressed; in others, it might not be. DNA parts that have no function are never activated.
In the factor graph, each variable is a node. The scoring function comprises many local scoring functions that look for a small number of variables. For that small set of variables, it finds a score for each configuration of those variables. The local scores' sum is the total score. "It's a nice way to decompose a very complex problem into a whole bunch of simpler problems," Frey says. The scientists then compare the factor graph data to known gene patterns.
Because the factor graph provides a computational framework for vetting the best configuration of variables as well as discovering them, the team came up with surprising results that led to a major revision of the view of the mammalian genome. Although some research claims many genes are left to discover, Frey's team has shown that might not be true. "Beyond the genes we found," Frey says, "we don't believe there exists many new protein-coding genes."
At the University of Texas at Arlington, Lawrence Holder has developed Subdue, another pattern recognition system based on graphs. A data mining system that represents data as a collection of nodes and links between the nodes, Subdue works by searching through graphic data using a heuristic based on the notion of compression.
After researchers input a big graph into the system and run a search, Subdue finds a pattern that has several instances in the graph. The system then replaces all of those instances by a single node, making the graph smaller. The larger the pattern, and the more instances it has, the more compression you get. "The more it compresses, the more we're interested in it," Holder says.
A practical application of Subdue examines a chemical structure to determine whether it causes cancer. The system represents the chemical in terms of its atoms and the bonds between them (the atoms are nodes in the graph, and the bonds are links). For the system to learn, researchers input many cancer-causing chemicals as graphs, which the system searches to find recurring patterns, or subgraphs. It then searches through the space of subgraphs to find a pattern that shows up a lot. That pattern is then matched against the new chemical's structure.
"The interpretation would be if this submolecule shows up in 90 percent of these chemicals that cause cancer, then it may be predictive," Holder says. So, if a new chemical contains the subgraph, he says, "you might predict that this chemical may cause cancer, and you may want to go off and test it in the laboratory."
Holder has tested the system in the American Cancer Institute's predictive-toxicology challenge. The Institute releases information on both a set of chemicals that it has determined to be cancer causing and a set that isn't. Participants speculate which chemicals cause cancer, and the one with the most correct guesses wins the challenge. Holder won the competition in 2000.
Subdue is also useful for detecting patterns of potential terrorist activity and locating potential terrorist networks. Holder trained his system on simulated data that the US Air Force's Evidence Assessment, Grouping, Linking, and Evaluation program created. The domain simulates the evidence available about terrorist groups and their plans before they put them into action.
Following a general plan of starting a group, recruiting members, acquiring resources, communicating, visiting targets, and transferring resources between actors, groups, and targets, the domain contains numerous concepts. The concepts include threat and nonthreat actors and threat and nonthreat groups.
Trained on patterns that give examples of threat potential, Subdue searched the simulated data to find similar types of patterns. The system achieved 78 to 93 percent accuracy discriminating threat from nonthreat groups.
Cornell University professor Shimon Edelman, in collaboration with colleagues at Tel Aviv University, has created a program that can discover patterns in languages, learn them as grammars, and then generate sentences of its own in that language. The system is called Adios (automatic distillation of structure), and it has been tested on both natural languages, such as English and Chinese, and artificial grammars, such as those in DNA and music.
"You can only recognize patterns if you have the right primitives, the right features," Edelman says. "It's like having the right glasses."
Language contains patterns that on its face are invisible, and it's generally thought to possess structure beyond just the serial order of words, a grammar. "The true structure of the sentence is a kind of tree," Edelman says. Adios combines statistics and rules applied to a body of text in a language to discover the grammar. The system can then generate sentences in that language.
"It can do things like assign structure to a new sentence," Edelman says. "It's not such a big deal to recognize a pattern on which you've trained your system." The team is patenting the technology, and Edelman wants to put it to commercial use. One possible arena is speech recognition technologies.
Patterns in theory
Research on the theoretical aspect of pattern discovery is also generating useful applications. At the University of California, Davis, Jim Crutchfield has leveraged his interest in "what a pattern is" to apply a pattern's abstract definition—which he calls a causal state—to different kinds of processes. He defines causal states as groups of histories that lead to the same knowledge about the future.
The mathematical theory that defines the causal states leads to a small number of possible ways to find the causal states. The mathematical definition of being in the same state of predictability about the future means that, by looking at data, you can estimate and make predictions about the future on the basis of different points in time having different histories. From that definition, Crutchfield derived an algorithm that describes how to group histories that provide knowledge about the future when those histories are basically predictively equivalent. "We just apply this in these different domains," he says, "whether it's a spatial pattern, like cellular automata, or time series, or looking at complex materials."
Crutchfield has applied his causal-state definition to examining dynamic systems, irregular crystals, hidden Markov models, and cellular automata. One field of application is quantum computation.
"A current proposal for implementing molecular computers is to look at very long chain molecules and to design the interactions between the atoms in the molecular chain so that they implement various of these cellular-automata rules," Crutchfield says. "So this pattern discovery system that we have for cellular automata is making a catalog of all the possible kinds of interactions and what sorts of information storage structures they can produce, and how those information storage structures can be moved around and interacted, and how they interact to process information."
Crutchfield is working on developing a library called the Encyclopedia of Cellular Automata. "It will be a resource for people working in nanotechnology," he says, "to look at how to design molecular systems that have only local interactions but that will produce in their behavior large-scale structures that can be used for doing computations."
According to Nello Cristiani, associate professor of statistics at UC Davis, people have always been attracted to patterns and pattern recognition. "In a way, this is the essence of science and most cognitive processes, such as generalization," Cristiani says. "Now this activity has been automatized, and we rely heavily on it as a society.
"There would be no genome project, no speech recognition, and probably no credit card system without it. The last decade has seen a revolution in pattern recognition technology. Machine learning algorithms are now faster, simpler, and more accurate in generalization."
"Sassy" Chatbot Wins with Wit
When you sit down for an online chat with computer scientist Rollo Carpenter, you're not quite sure if it's him on the other end or his virtual alter ego, a chatbot named George. And that, in a nutshell, is the point of Carpenter's research.
Carpenter's work is inspired by Japanese roboticist Masahiro Mori's Uncanny Valley theory. In 1970, Mori asserted that, as robots become increasingly human-like in appearance, movement, and behavior, they will illicit emotional responses from human beings that border on human-to-human empathy levels.
In discussing digital immortalization and a cultural renaissance by way of virtual reality, Carpenter views advanced chatbots as harbingers of a new era in machine learning. "We intend to get very close indeed to the Uncanny Valley," he says.
Bringing home the bronze
During the 2005 Loebner Prize contest ( www.loebner.net), a panel of judges found George to be the most convincing conversationalist of the four chatbot participants, which included reigning three-time champion Alice. The contest, launched in 1990 by Hugh Gene Loebner and touted as "the first formal instantiation of a Turing Test," gauges the contestants' "intelligence" levels.
In the contest's 15 years, no contestant has won a silver or gold medal, awarded for convincing at least half of the judges that a text-based program or virtual persona is actually real. However, every year a bronze medal and a cash prize have gone to the most human-like program. This year, Carpenter's Jabberwacky program, which hosts George, brought home the bronze. Now, George has caught the interest of the computer science community as well as thousands of visitors to the Jabberwacky site ( www.jabberwacky.com).
Many of the site's visitors find it hard to believe George isn't human after conversing with him. Even some of the contest's four judges were fooled—at least initially.
"It [becomes] crystal clear within a couple of lines of communication who is human and who is bot," according to one judge, Lila Davachi, assistant professor of psychology at New York University. "However, I found Jabberwacky to be the most interesting bot by far because it displayed some very human qualities. It was sassy and playful."
What makes George seem so human? It appears George is different from other chatbots not only because of his personality but also because of how he learns. Carpenter says machine learning is trending toward being statistical and probabilistic, relying on analyzing significant volumes of data.
"Most of the chatbots that exist today work in a hard-coded, entirely predictable fashion," he says. "A series of 'if' (or equivalent) statements created by the programmer evaluate the input and return known results, either as whole sentences or as modifications of the input."
Although George is also statistical, probabilistic, and data intensive, the bot program is also something else: chaotic. "It never turns probabilities into numbers," explains Carpenter. "It avoids looping through data, summing up an estimate of 'fitness for purpose.'"
For George, context is key. Output is based on an interpretation of the current context—taking the current conversation into account, while comparing it to past conversations.
"Techniques for finding context within complete conversations are the real key to the AI's success to date," says Carpenter. "The context finding tends to do a lot inaccurately, allowing influences from all sorts of quarters even if they may be individually only partially relevant and often even irrelevant."
He says tiny differences in context, however seemingly unconnected, can give the program a reason to choose one thing over another, and those differences are infinitely better than randomness.
Many people believe George is human because he mimics human behavior, which is contextual, rather than human responses, which might or might not be. For example, someone testing the program might ask, "What color is a red apple?" A programmer could easily add a rule that deals precisely with that question, but George formulates the answer to such a question on the basis of what past users have said to him. Most humans would find the question inane and might respond sarcastically, "Pink with yellow and white polka dots." So, this is the kind of answer that Carpenter's program is likely to give—what some users see as wit.
Just as humans typically increase in intelligence from infancy through adulthood, learning from each person with whom they come into contact, so does George—who his creator says is still "a child."
"As the data set grows," Carpenter says, "there are ever-improved chances of accurate overlaps between the current conversation and its predecessors, so it becomes increasingly able to make intelligent connections." And as George's conversational level improves, people are more willing to engage in sensible, consistent, and valuable dialogue. The result? Data continues to improve over time.
Looking to the future
In discussing how George is in many ways his thumbprint, Carpenter suggests that advancements in robotics might take the cloning debate out of the biology labs and into the computer labs. Whatever societal and sociological implications would exist in the wake of widespread proliferation of highly evolved chatbots, the computer scientist is confident about the advancements in AI they represent.
"Ultimately, this is a form of digital immortalization, clearly not a complete one, yet much more immediate and closer to a person's 'being' than the more classic technique of writing an autobiography," he says. "Only one person trained George—me. So, George does gradually become my reflection, or a reflection of the persona I project, in terms of speech patterns, characteristics, and interests."
But, Carpenter says, anybody can create his or her own chatbot at Jabberwacky, just as he has—hence his prophecy that Mori's Uncanny Valley is approaching.
Carpenter has partnered with Televirtual in creating a 3D animated character with voice input and custom voice output. (The partnership has spawned the current image of George.) He has also commenced collaborative research with other computer scientists to construct a highly lifelike robotic head that imitates those who interact with it in voice, movement, and facial expression. Think of it as George, version 3.0.
"Imagine a Jabberwacky with millions of characters like George in place of today's handful, operating with Google-scale serving capacity," says Carpenter. "Almost certainly it will be shockingly realistic, whether or not it passes a formal Turing Test."
"Shockingly? I doubt it," says Davachi. "But that is an empirical question; I'll wait and see."