Pages: pp. 4-7
Viewed through a microscope, protein molecules can resemble a pile of tangled fishing line. Yet their complex arrangement is precisely what allows proteins to catalyze the untold billions of chemical reactions needed to sustain life. Now, some scientists armed with AI techniques such as pattern recognition and an extension of Lisp believe they've found a new way to map and even create proteins along with other biological constructs in cells. They do this by employing rules akin to the grammar and syntax used to decipher human language—and computer code as well.
The work goes back several decades, with papers published by bioresearchers such as Howard Pattee, David Searls, and others. But it has seen a modest revival recently, as groups of scientists have successfully demonstrated tools able to interpret proteins and genes and even create new ones. Decades from now, language-based models could yield a vast repository of new medical treatments and lead to an era when we design new proteins as readily as we write down sentences.
"With human language, there are two articulations, called the first and second articulations," explains Sungchul Ji, an associate professor in the Department of Pharmacology and Toxicology at Rutgers University's Ernest Maio School of Pharmacy. That is, sentences are made of words (the first articulation), which in turn are made up of letters (the second articulation). "Similar things happen in cell language," Ji says. "Cells seem to make molecular sentences out of a set of molecules arranged in space and time and molecules, of course, out of atoms."
Just as it's often possible to still understand a sentence after words have been deleted or switched around, Ji says, protein molecules often can continue to catalyze the same chemical reactions inside an organism when they get rearranged. Likewise, changing the order of the atoms in a molecule can alter the molecule's function, in the same way that changing the spelling of words in a sentence can alter their meaning.
Of course, you can't push the analogy between spoken and molecular languages too far, Ji and others working in the field will hasten to add. For all its nuances, human language is essentially linear. In contrast, proteins' components can influence any number of other components in the same chain depending on how the chain is folded. That is, while the relationship between words is linear, the relationship among a protein's components can be multidimensional. Also, once words get written down into a sentence, they become static. That's hardly the case with protein molecules, which routinely reconfigure themselves whenever they encounter a stimulus—such as heat—or other bodies in a cell.
Even so, the language paradigm has been found to work not just with proteins but with smaller biological constructs as well. For example, Isidore Rigoutsos, manager of the Bioinformatics and Pattern Discovery Group at IBM's T.J. Watson Research Center, has applied the paradigm in his research on antimicrobial peptides. These are small chains of 20 to 30 amino acids that form part of the immune systems of all organisms, protecting them from harmful bacteria.
Antimicrobial peptides "can attack the target without attacking the host. And they are very quick in their response," says Rigoutsos, who teamed up on the project with MIT chemical engineering professor Gregory Stephanopoulos and former PhD student Kyle Jensen.
That deadly ability works something like this. The peptide chains possess a positive charge at both ends. The positive charge draws them toward a bacterium's membrane, which naturally exhibits a pattern of negative charges across its surface. When the peptide comes in contact with the bacterium, it "effectively rips the membrane apart," thus killing the cell, Rigoutsos explains.
Little wonder some researchers believe antimicrobial peptides could become a prized alternative to antibiotic drugs in the fight against drug-resistant staph and other killer bacteria.
In theory, you could design a peptide that's particularly adept at killing staph. However, the traditional methods for finding or building peptides involve altering known peptides or testing random samples in the lab. Both methods would prove expensive and time consuming. Peptides con tain 20 or more amino acids, which can re sult in 1,026 possible sequences.
In search of a shortcut, Rigoutsos and his colleagues looked at common sequences—called the grammar—contained in known antimicrobial peptides.
Using these known antimicrobial peptides as a training set, they generated random examples of chains containing 20 or so amino acids to see whether they might produce antimicrobial peptides. Some random chains came close to existing antimicrobial peptides. For the purposes of the experiment, the team disregarded these candidates.
However, the algorithm created 40 previously unknown peptide sequences. "These were brand-new amino acid sequences, although there is the possibility that they may yet be discovered in nature," Rigoutsos says. His team synthesized these in the lab. Of the 40 synthesized peptides that they tested, 18 demonstrated an ability to attack bacteria.
Better still, the peptides Rigoutsos and his team created were able to attack some of the deadliest known bacteria, such as Bacillus anthracis (anthrax) and Staphyloccus aureus, the frequent culprit behind drug-resistant hospital-borne infections.
Meanwhile, working at Harvard Medical School, computer scientist and cell biologist Aneil Mallavarapu and his colleague Jeremy Gunawardena, director of the school's Virtual Cell Program, have focused on kinases, proteins that regulate vital activities in a cell. Kinases do this by transporting phosphates—which act as chemical messages—to other cell proteins.
Owing to kinases' complexity, trying to understand how they work would be nigh impossible via traditional means. As Mallavarapu explains: "It's not clear looking at the ball-and-stick diagrams that biologists tend to draw, what the behavior is going to be. Not only do these very complex interactions happen laterally, there are also feedback loops. And it's very difficult for humans to reason about feedback loops."
Accounting for all the interactions of kinases can quickly lead to big, complex mathematical systems involving thousands of variables. "If you try to write these things in the language of ordinary mathematics, you have this tremendous ball of spaghetti," he says. "If you want to change your model, you have to rewrite everything from scratch."
As a workaround, Mallavarapu wrote a computer language called Little b that's based on the time-proven AI language Lisp. "In Lisp, code is represented as data," he says. "This means you can write code which generates code, a technique which makes it easy to write high-level domain-specific languages (like Little b) inside Lisp."
Mallavarapu's language extends Lisp in a way that allows protein molecules to be represented as discrete modules. But significantly, the modules themselves incorporate a plethora of data describing their characteristics. So, the language enables researchers to understand how groups of kinases function on the basis of their structure and how altering that structure would affect their behavior. "We represent molecules as graphs and build molecular complexes by combining graphs using pattern-matching operations which represent biochemical reactions," Mallavarapu says.
Jean Peccoud, an associate professor at Virginia Tech's Bioinformatics Institute, is using a language-based approach that he hopes will one day help researchers find cheaper or more sophisticated ways of producing proteins. Perhaps, he says, "engineers may be interested in specifying a target behavior and exploring the design space to find one design implementing it."
To make that happen, he and his colleagues have devised context-free grammars (CFGs), segments of DNA that—thanks to his methodology—can be an alyzed or constructed as readily as developers work with computer code. "Right now, the grammars that we have could be compared to the second generation of programming languages," Peccoud says. "We could imagine developing a description language that could be compiled for different target organisms like C can be compiled for different compilers."
Figure 1 A template of genetic switches from the GenoCAD Web site ( www.genocad.org). From this template, users can quickly generate different switches by selecting genetic parts listed under the symbols D, B, and E. The site can generate different switches in seconds without any risk of operator errors, whereas it would take much longer using traditional error-prone bioinformatics tools.
Peccoud believes CFGs can be stored in vast libraries that scientists looking to fabricate new drugs can access as readily as we do Google word searches. He and his colleagues have set up the GenoCAD Web site ( www.genocad.org) to illustrate their research. Scientists visiting GenoCAD can assemble sequences or use public-domain templates others have built as a starting point. The scientists can then validate and test the construct against a designated set of grammatical rules. Alternately, a researcher can upload a sequence assembled elsewhere to see whether it's valid, in the same way software engineers upload code for testing. "The technology to synthesize a genome is now available," Peccoud explains. "But people are still representing genomes as a series of bases. We would like to redesign the yeast genome using a CFG. This could lead to a number of applications in metabolic engineering and biofuel production."
GenoCAD, which is still in beta, might not represent desktop genetic design, but it offers a tantalizing vision of how language-based models might be used in the future. "The ultimate vision," says Peccoud, "would be to have a unified linguistic description of synthetic DNA sequences that would span different levels of organization within the genomes from the bases to high-level patterns." Call it Peccoud's Unabridged Dictionary to the Language of Life.
Every day in thousands of cities across the planet, millions of cars creep along in traffic, wasting fuel, spewing carbon, and adding to urban stress levels. By some estimates, daily traffic accounts for a third of humanity's energy consumption.
Common sense would tell you that simply timing traffic lights better could do a world of good. And AI researchers have been trying to do just that for decades, armed with tools such as fuzzy logic, evolutionary algorithms, and reinforcement learning. Yet the problem, as Dirk Helbing, professor of sociology at the Swiss Federal Institute of Technology (ETH) Zürich, points out, is that even supercomputers can't handle the number of variables involved with optimizing traffic lights—even in small cities.
But a growing fraternity of mostly young AI researchers are taking a fresh look at improving urban traffic flow, using autonomous agents and other distributed methods to get around the processing problem. "Compared to object recognition, for example, the field of AI-based traffic control is indeed quite small," admits Marco Wiering, assistant professor of artificial intelligence at the University of Groningen in the Netherlands. "Although, I see more and more papers on this topic being written."
Carlos Gershenson, a postdoctoral fellow at the New England Complex Systems Institute and Vrije Universiteit Brussel, began thinking about traffic light systems four years ago. While seated in a cab in the capital city of his native Mexico, Gershenson watched as the lights ahead progressively changed from red to green. Traffic systems designers call this the "green wave," a timed sequence meant to synchronize traffic flows with light changes, letting cars proceed at a steady speed. Gershenson realized that the green wave before him was poorly timed and was actually making the congestion worse.
Sometime later he hit upon an interesting notion: "Most engineering tries to optimize a solution to a problem," he says. But optimizing traffic flows is especially tough because the variables change constantly. "Even if you have perfect information showing the position of all the cars, if you try to predict where those cars will be in one minute you can't. Because a pedestrian jaywalks, or the car ahead or behind brakes," Gershenson says. The difficulties increase all the more if you attempt to tie several traffic lights together, let alone a city's entire traffic light system.
Given the enormous number of potential, unexpected wild cards, optimizing how traffic lights turn from red to green simply wouldn't work, he realized. "By the time you get an optimal solution, the situation has changed," he says. A better way to make traffic lights work more efficiently would be to allow each light to adapt to traffic patterns on its own, Gershenson thought. So he devised a rule-based AI program that operated on an elegantly simple premise. "The street with more cars gets more green-light time. This tends to make cars travel in platoons. So platoons travel from light to light without having to stop."
The system would be scalable, he says, because the individual traffic lights aren't linked to each other. The only hardware requirement would be some kind of sensor at each intersection to measure the number of oncoming vehicles.
So far, Gershenson, working with his student colleague Seung Bae Cools, has only tested his idea via simulation. But when he compared his virtual results with actual traffic data from the streets of Brussels, he found that setting up his system with just 10 traffic lights would cut waiting time by 50 percent, which amounts to 25 percent of the average driver's total travel time. What's more, fuel savings could equal one-half million Euros per year. Little wonder, the Belgian region of Flanders is one of several localities taking a look at Gershenson's idea.
At Utrecht University, Wiering is working on a system in which drivers communicate their destination to a computer linked to the city's traffic light grid. Using a reinforcement-learning algorithm, the computer times the traffic lights in a way that enables each car to reach its destination at the fastest possible rate. The system—which, like Gershenson's, has yet to be tested on the street—first estimates the time a car will take to reach its stated destination if it encounters no red lights and how long the trip would take if the car encounters a red light at each intersection. Then, the system adjusts the city's lights to minimize each car's estimated travel time.
Compelling drivers to communicate their destination might have seemed far-fetched until recently. In-car navigation devices have become commonplace, and some already alert drivers about congested areas. Meanwhile, some cities have devised novel ways to limit congestion. London, for example, became the first major city to charge drivers wishing to enter its central regions, a trend other traffic-plagued, revenue-starved urban areas might emulate.
Years from now, autonomous control systems could replace navigation systems. Cities might even require drivers to surrender control of their vehicles when entering the downtown core. Anticipating that day, Kurt Dresner, a PhD candidate at the University of Texas (UT), together with Peter Stone, an associate professor at UT's computer science department, devised a system in which autonomously driven vehicles reserve space at upcoming intersections. This lets cars take up all the available space within an intersection, as the system assigns them specific positions (referred to as tiles) in it, along with time slots.
Under this plan, red lights and stop signs would disappear. Instead, the autonomous cars could proceed nonstop in four directions at once, at speeds no human driver could match. The researchers' simulations ( www.cs.utexas.edu/~kdresner/aim/oldsim) might seem frightening as you watch virtual cars whiz by each other. But statistics paint an even more frightening picture of urban congestion today. Besides the deadly accidents our society tolerates, the average time each of us spends in traffic amounts to 46 hours annually, based on Texas A & M University research. Put another way, that's the equivalent of thousands of lifetimes.