Issue No. 06 - November/December (2001 vol. 16)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/5254.972064
Biology has rapidly become a data-rich, information-hungry science because of recent massive data generation technologies. Our biological colleagues are designing more clever and informative experiments because of recent advances in molecular science. These experiments and data hold the key to the deepest secrets of biology and medicine, but we cannot fully analyze this data due to the wealth and complexity of the information available. The result is a great need for intelligent systems in biology.
There are many opportunities for intelligent systems to help produce knowledge in biology and medicine. Intelligent systems probably helped design the last drug your doctor prescribed, and they were probably involved in some aspect of the last medical care you received. Intelligent computational analysis of the human genome will drive medicine for at least the next half-century.
Even as you read this, intelligent systems are working on gene expression data to help understand genetic regulation and ultimately the regulated control of all life processes including cancer, regeneration, and aging. Knowledge bases of metabolic pathways and other biological networks make inferences in systems biology that, for example, let a pharmaceutical program target a pathogen pathway that does not exist in humans, resulting in fewer side effects to patients. Modern intelligent analysis of biological sequences today produces the most accurate picture of evolution ever achieved. Knowledge-based empirical approaches currently are the most successful method known for general protein structure prediction, a problem that has been called the Holy Grail of molecular biology. Intelligent literature-access systems exploit a knowledge flow exceeding half a million biomedical articles per year. Machine learning systems exploit heterogenous online databases whose exponential growth mimics Moore's law.
So why is this happening now? The answer depends on whether the question is philosophical or practical. Philosophically, it is the inevitable result of the great sweep of intellectual history. Practically, it is because biology is undergoing a data explosion of unprecedented magnitude.
When you look at the intellectual history of the previous century (how strange it seems to term it thus, even now), inevitably you notice that the first half was dominated by chemistry, physics, and mathematics. Quantum mechanics, relativity, and Gödel's incompleteness proof literally changed the mental world in which we live. The second half of the century, however, was dominated by biology and the computing sciences. The genetic code, recombinant organisms, the World Wide Web as an integrated entity, and an intelligent system defeating the world chess champion defined the times. Thus, computational biology sits squarely at the center of the two dominant intellectual forces of the last half-century. Within that historical necessity, the prominent role of intelligent systems is forced on them by the remarkable complexity of the underlying domain.
Biology has become an object of great computational interest because recent technological advances have enabled massive data generation in many critical areas. Both the quantity and diversity of available data are growing rapidly. Figure 1 shows the growth in molecular structures housed in the Protein Data Bank, 1 a repository for 3D biological structure data. Figure 2 shows the growth in DNA sequences housed in GenBank, 2 a repository for 1D nucleotide sequence data. Other major international biological databases are also experiencing rapid growth. Many different high-throughput data generation technologies have come online, providing large amounts of data in diverse areas: combinatorial chemistry for drug discovery, high-throughput screening for bio-assays, two-hybrid protocols for protein interactions, gene expression arrays for monitoring the protein expression of a whole cell, and so on. Add the fact that biomedical research literature contains about 11 million citations and is growing by roughly half a million papers a year, and the amount of data and information to process is staggering. It is exceeded only by the benefits promised in the knowledge we will extract from it.
We can view computer science as a collection of solutions in search of a problem, and the study of life now provides rich problems associated with rich information. The prominent role of intelligent systems arises because, as we all know from personal experience, "Sometimes life just gets complicated!" Intelligent systems are well suited to the complicated domain of biology and medicine. They are robust in the face of inherent complexity, able to extract weak trends and regularities from data, provide models for complex processes, cope with uncertainty and ambiguity, hold the potential to bring content-based retrieval to the biomedical research literature, possess the ontological depth needed to integrate diverse heterogeneous data bases, and in general, aid in the effort to handle semantic complexity with grace.
In This Issue
This special issue would have been impossible without the gratifying outpouring of support from the research community involved in intelligent systems for biology. Some 200 scientists have helped produce it: we received 52 manuscripts totaling 163 authors and subjected them to a total of 177 blind reviews by 37 volunteer referees, none of whom was me or an author on any reviewed article. Due to the tremendous volume of high-quality manuscripts, a second special issue in the series is planned for March/April 2002. At all levels, this has been very much a community effort.
The research community behind this special issue is served by a vibrant, and growing, specialized professional society, the International Society for Computational Biology (ISCB), as well as by larger traditional societies such as the IEEE Computer Society, the ACM, AAAI, AAAS, FASEB, the Protein Society, and the Society for Mathematical Biology (I have joined them all, and suggest that you do, too). The ISCB ( www.iscb.org) is an excellent contact point for intelligent system practitioners interested in biology.
The current president of the ISCB is Russ Altman, an early champion of intelligent systems in biology 3 and a leading Figure in modern bioinformatics. Altman's article opens the "Perspectives" section with an insightful survey titled "Challenges for Intelligent Systems in Biology." This section closes by emphasizing the international character of the field with "The Impact of European Bioinformatics," by Alfonso Valencia, and "The Asia-Pacific Regional Perspective on Bioinformatics," by Satoru Miyano and Shoba Ranganathan. These all provide different views of the field by leading experts.
The articles that follow showcase high points from some of the most interesting and exciting research in the field today. Still, the potential role of intelligent systems is so broad—and the opportunities so great—that this small volume only presents the tip of the iceberg of today's intelligent systems in biology.
"Automatic Pattern Embedding in Protein Structure Models" describes how structural knowledge gained from protein crystals can help predict the structure and function of novel protein sequences. Protein structure prediction from sequence is a central problem of molecular biology. Protein function follows directly from structure and determines the protein's role in biomedical systems. Knowledge-based empirical approaches currently yield the best predictors. The approach here relies on patterns learned from previously seen data and combined according to a Bayesian formulation, which is a familiar architecture in intelligent systems.
"Improving Objectivity and Scalability in Protein Crystallization" brings together robots, machine vision, image analysis, case-based reasoning, and knowledge discovery in a clever and elegant system that targets the rate-limiting step in knowledge acquisition at the atomic level. Almost all of our atom-level knowledge about protein structure and its relation to function comes from x-ray diffraction through protein crystals. It is the necessary first step—getting the protein to crystallize—that most impedes this process. Crystallization often proves difficult or impossible for complex reasons that are poorly understood. This gravely limits our molecular structure knowledge. A system that could learn to produce quick, reliable high-quality protein crystals would revolutionize structural molecular biology.
"Geno2pheno: Interpreting Genotypic HIV Drug Resistance Tests" proposes a machine learning technique for predicting drug resistance in HIV therapy. HIV (through AIDS) is the fourth largest cause of death and the largest cause of productive years of life lost in the developed world and is devastating many developing regions. The article illustrates one of the many medical care settings now touched by intelligent systems. Indeed, the medical domain and medical informatics are long-standing and familiar success stories for AI, and this article continues that fine tradition.
"Toward More Intelligent Annotation Tools: A Prototype" addresses one of the most important problems in bioinformatics: how to extract high-quality information-level summary knowledge from the exponentially growing international scientific databases. This is a rich opportunity for intelligent systems. The article describes how to produce concise descriptions from a protein ID in SwissProt 4 (a repository for 1D protein sequence data). It exploits the database entry annotations that SwissProt already records and so might scale well as SwissProt continues to grow.
"A Knowledge Base for Integrated Biological Systems" uses knowledge representation methods to situate the individual protein functions into an integrated system with multiple interacting players. One ultimate goal of systems biology is to start with knowledge of the complete genome sequence, and thus all proteins encoded by that genome, and proceed automatically to a reconstruction of the biological systems that are implied. The knowledge representation here describes the protein partners and the numerous complex relationships they exhibit. The schema employ concepts long familiar to intelligent systems: classes, associations, hierarchies, an algebraic modelling language, and classification.
"Using Combinatory Categorial Grammar to Extract Biomedical Information" describes research at the intersection of bioinformatics and natural language processing. The archived biomedical literature is a treasure trove of interesting pieces of information, but its huge volume and rapid growth make it increasingly difficult to locate the information most relevant to a particular problem at hand. This article reviews recent approaches to biomedical information extraction, and presents an implemented system that uses a full-fledged natural language grammar.
"Diagnosis Systems in Medicine with Reusable Knowledge Components" looks at medicine from the general viewpoint of knowledge representation and medical informatics, two areas whose fruitful interaction has enriched both AI and medicine. Reusability in knowledge is desirable for much the same reasons it is attractive in software engineering: efficiency, reliability, and economies of scale. If the article's concepts behind reusability of knowledge components scale well and extend to other areas (as hopefully they will), they will help accelerate development of new diagnosis systems in many areas.
Two people behind the scenes deserve extra special thanks: Margaret Wyvill kept track of all the manuscripts and reviews, and Mario Espinoza was the Webmaster for the review Web site. Special thanks to Nigel Shadbolt for his vision in initiating this special issue, and also to the IEEE Intelligent Systems editorial staff for help and encouragement. Thanks to the PDB and NCBI/GenBank staff for Figures 1 and 2. My deepest debt goes to the volunteer referees, who are responsible for the high scientific quality of these pages and have my shining gratitude and devoted thanks. The following honored colleagues are all hereby named associate guest editors of this special issue: Barb Bryant, Philipp Bucher, Tim Ting Chen, David M. Cooper, Rich Cooper, John Corradi, Terence Critchlow, Steve Culp, Dan Davison, Tom Defay, Francisco M. De La Vega, Valentina Di Francesco, Paolo Frasconi, Frederique Galisson, Richard Goldstein, Harvey Greenberg, Debraj GuhaThakurta, Reece Hart, Dennis Kibler, Mark Lacy, Franz Lang, Gerald Loeffler, Satoru Miyano, Uwe Ohler, Shoba Ranganathan, Isidore Rigoutsos, Paolo Romano, Burkhard Rost, Andrey Rzhetsky, Hershel Safer, Herbert Sauro, Steffen Schulze-Kremer, Vijaya Tirunagaru, Herbert Treutlein, Iosif Vaisman, David Wild, and Tau-Mu Yi.
Richard H. Lathrop is vice-chair of undergraduate education in the Information and Computer Science Department at the University of California, Irvine. In addition to a PhD in artificial intelligence, he holds degrees in electrical engineering, computer science, and mathematics. His research interests include applying intelligent systems and advanced computation to problems in molecular biology, especially protein structure prediction, protein-DNA interactions and genetic regulation, rational drug design and discovery, bio-nanotechnology, and other molecular structure/function relationships. Contact him at ICS Dept. #3425, UCI, Irvine, CA, 92697-3425. Email firstname.lastname@example.org; www.ics.uci.edu/~rickl.