Fusing computing and biology expertise, bioinformatics software provides a powerful tool for organizing and mining the vast amounts of data genetics researchers are accumulating.
Biology has increasingly become a data-driven science. The emergence of high-throughput data acquisition technologies for investigating biological phenomena has been an important factor in this process. For example, the recent completion of the human, rice, and flowering plant Arabidopsis thaliana genome maps signifies a major milestone in data acquisition.
The development of novel algorithms and databases to catalog, organize, harness, and mine the increasing amount of data such research efforts generate has also been important. The potential for scientists to infer significant biological knowledge computationally from a desktop is both appealing and real.
In this issue, we gather perspectives, articles, and reports to showcase the emergence of bioinformatics software as a discipline in its own right. The " Molecular Biology for Computer Scientists
" sidebar provides some pertinent background information.
Bioinformatics Software Systems Themes
As life scientists and computational scientists interact to create useful bioinformatics software systems, several themes or lessons recur. We identify seven themes:
• the nature of biological data;
• data storage, analysis, and retrieval;
• computational modeling and simulation;
• biologically meaningful information integration;
• data mining;
• image processing and visualization; and
• closing the loop.
Each of this issue's cover features touches on one or several of these themes.
The nature of biological data
The life sciences literature presents biological results based on carefully collected, analyzed, and vetted data that researchers can use with high confidence. Everyday bioinformatics, however, deals with raw data collected from recently completed experiments in the form of images, charts, or numbers and with sequence data collected from a wide variety of online databases. We should regard such data with some skepticism.
Any significant collection of raw experimental data includes experimental errors—systematic or random. Obtaining statistically meaningful data requires careful experimental design and replication of results. On the other hand, experiments are expensive in terms of professional labor, reagents, equipment, and time. As a consequence, biological data is always incomplete.
For science to progress, we must combine inductive reasoning based on existing biological information with new experimental results. Indeed, some biological data is inherently unknowable, such as the genomes of most extinct species. Bioinformatics software system developers must always be aware that some uncertainty exists in any results the system generates. Thus, characterizing or quantifying the uncertainty is worth consideration.
Data storage, analysis, and retrieval
The high-volume, data-driven nature of modern experimental biology has led to the creation of many databases that contain genomes, protein sequences, gene expression data, and other data types. Researchers often key their retrieval of information from such databases primarily on one characteristic, such as the nucleotide or amino acid sequence, organism, gene annotation, or protein name.
Answering queries often involves some form of data analysis, such as statistical significance, clustering, or sequence homology search. The Basic Local Alignment Search Tool is typically the first bioinformatics tool a biologist uses when examining a new DNA or protein sequence. BLAST compares the new sequence to all sequences in the database to find the most similar known sequences.
Computational modeling and simulation
In addition to generating experimental data, computational simulation also plays a central role in understanding many biological processes. For example, researchers can study processes such as cell division by modeling reaction networks as a set of simultaneous differential equations. 1
They can then use tools from numerical and scientific computing to address questions such as "At what rate does this enzyme catalyze cell division?" The 1 March 2002 issue of Science
showcases major developments in systems biology, many of which rely on simulation.
Biologically meaningful information integration
A tremendous quantity of valuable heterogeneous biological information can be found online, with bullish growth in the future a certainty. Examples of information sources include genomic sequences, gene sequences, expressed sequence tags, protein sequences, microarray experiment images and raw data sets, 2D protein gels, protein domains, and the literature of genetics, biochemistry, and molecular biology.
Researchers cannot answer the cutting-edge questions in biology with information from just two or three of these sources. A global database that integrates all these sources for all purposes is a pipe dream, given that we cannot predict in advance the myriad needs of biologists for accessing the information. However, resources dedicated to restricted domains—for example, BioCarta ( http://www.biocarta.com/), an encyclopedia of regulatory pathways, and CyanoBase ( http://www.kazusa.or.jp/cyano), a Web resource for cyanobacterial research—have recently become available. Thus, bioinformatics software systems increasingly face the task of integrating information from diverse sources. Further, their developers must address the challenge of obtaining biologically meaningful and useful results from those many sources.
Bioinformatics systems benefit from the use of data mining strategies to locate interesting and pertinent relationships within massive information. For example, data mining methods can ascertain and summarize the set of genes responding to a certain level of stress in an organism. Researchers can use graphical models such as Bayesian networks and relational algorithms such as inductive logic programming to mine such gene sets and model a gene expression network. Even a cursory glance through the literature in journals such as Bioinformatics reveals the persistent role of data mining in experimental biology. Integrating data mining within the context of experimental investigations is central to bioinformatics software.
Image processing and visualization
Many results in experimental biology first appear in image form—a photo of an organism, cells, gels, or microarray scans. As the quantity of these results accelerates, automatic extraction of features and meaning from experimental images becomes critical.
At the other end of the data pipeline, naive 2D or 3D visualizations alone are inadequate for exploring bioinformatic data. Biologists need a visual environment that facilitates exploring high-dimensional data dependent on many parameters.
Closing the loop
Bioinformatics software must support the strongly iterative and interactive nature of biology research. Biologists typically revise and redesign experiments based on results from previous experiments. Providing feedback to earlier stages of an experiment based on downstream data—for example, to reorganize a microarray layout or alter dye concentrations—is central to improving the efficiency of biological investigation.
The best bioinformatics software systems address the problems in bioinformatics within a context of insights into the
• computational complexities of those problems, and
• sophisticated knowledge of biology and current experimental technologies.
In this spirit, Mihai Pop, Steven L. Salzberg, and Martin Shumway present a tantalizing view of the development of sequence assembly systems for entire genomes. These systems must recognize the strengths and drawbacks of experimental technology and the challenging nature of real genomes, such as the existence of long tandem repeats.
In "Genome Sequence Assembly: Algorithmic Issues and Practical Considerations," the authors describe the techniques brought to bear on the computational issues of sequence assembly, including the theory of computation as it applies to NP-hardness, graph theory, and combinatorial algorithms; quality-assessment statistics; and heuristics that support tradeoffs among genome coverage, the number of misassemblies, and the number of contigs—a group of overlapping regions of a genome. Sequence assembly is typical of most bioinformatics problems in that the correct answer is perhaps either unknowable or only obtainable by paying the high cost of manual intervention.
The Assembling the Tree of Life (ATOL) project challenges biologists and computer scientists to use sequence and other biological information to determine the evolutionary relationships among existing species. ATOL further challenges researchers to represent these relationships in a tremendous evolutionary, or phylogenetic, tree. In "Toward New Software for Computational Phylogenetics," Bernard M.E. Moret, Li-San Wang, and Tandy Warnow present significant insights into the nature of ATOL's computational challenges and describe their progress in addressing those challenges.
Algorithms for reconstructing phylogenetic trees from sequences must scale to very large sequence sets. Moret and colleagues have thus developed disk-covering algorithms as a means for accomplishing scaling. High-performance algorithm engineering uses algorithmic and implementation savvy to produce highly efficient applications for challenging computational problems. The authors' Genome Rearrangement Analysis using Parsimony and other Phylogenetic Algorithms (GRAPPA) is a system that provides an excellent example of high-performance engineering and can teach valuable lessons to bioinformaticians.
In "BioSig: An Imaging Bioinformatic System for Studying Phenomics," Bahram Parvin and colleagues describe a system for archiving and interpreting microscopic images of small groups of cells drawn from mice and treated with ionizing radiation. Part of interpreting such an image involves segregating organelles, such as the nuclei, from the remainder of the image. Even smaller features, such as the chromatin—the chromosomes and associated proteins present in the nucleus when the cell is not reproducing—should also be identified as distinct from noise. The BioSig system for cell phenomics—the visual characteristics of a control or treated cell—can ultimately lead to predicting the effects of ionizing radiation in other mouse—and human—organs.
Meeting long-term bioinformatics goals requires reconciling information collected from studying biological phenomena at multiple scales and using multiple modes of investigation. For example, researchers can study biological processes at the DNA, mRNA, protein, enzyme, pathway, reaction network, or physiology levels. Each level gives a different view to the underlying mechanism, but together they help establish the basis for answering biological questions.
In "A Random Walk Down the Genomes: DNA Evolution in Valis," Bud Mishra and colleagues present the Valis system, which prototypes bioinformatics applications, and describe its use for studying cellular events in relation to DNA sequence evolution. This article, which comes closest to our closing-the-loop theme, also describes a sophisticated modeling of sequence evolution. The authors cover the software design in detail and also describe their system's modeling and computational capabilities. Valis-like systems have the potential to model large-scale genomic processes, a Grand Challenge for bioinformatics.
BioSig and Valis constitute integrated problemsolving environments 2
for bioinformatics applications. These software systems provide all the computational facilities necessary for solving a target class of problems.
From performing functional studies on genes or gene families in isolation, research has progressed to studying all the genes in a given organism simultaneously. Microarray bioinformatics has aided in this massive parallelization of experimental biology.
In "Interactively Exploring Hierarchical Clustering Results," Jinwook Seo and Bernard Shneiderman present an interactive visualization system for investigating data from microarray experiments. The authors introduce the basics of microarray technology for nonspecialists and describe how to achieve user-driven interactive data exploration in relation to a hierarchical clustering algorithm. Interactive exploration and visualization will become increasingly important as the dimensionality and diversity of the underlying data increase.
In addition to these theme articles, this issue of Computer also features two perspective articles on bioinformatics. In "Computers Are from Mars, Organisms Are from Venus," Junhyong Kim examines the new relationships between biology and computer science. Kim identifies benefits that each field can draw from the other, as well as the challenges in interdisciplinary research. The essay is an excellent starting point for new researchers, and it provides the necessary background to appreciate the five theme articles.
In "The Blueprint for Life?" Dror G. Feitelson and Millet Treinin challenge the assertion that DNA encodes everything needed to understand life. The authors examine how information is interpreted, transported, and communicated in biological systems, and conclude that there is more about these processes than is encoded in DNA.
We received 20 submissions for this special issue, which we forwarded to reviewers from both biology and computer science backgrounds. We thank these reviewers for their timely responses and the authors for responding quickly to the reviewers' comments and suggestions.
Because this issue does not aim to provide comprehensive coverage of bioinformatics software, it does not address some important biological problems such as pathway modeling. 3
Likewise, it excludes several novel computational techniques, especially from data mining. The content does, however, reveal a glimpse into the richness of bioinformatics software, providing a snapshot of its continuing evolution.
Lenwood S. Heath
is an associate professor of computer science at Virginia Tech. His research interests include algorithms, theoretical computer science, graph theory, bioinformatics, computational biology, and symbolic computation. Heath received a PhD in computer science from the University of North Carolina at Chapel Hill. He is a member of the IEEE, the ACM, SIGACT, and SIAM. Contact him at firstname.lastname@example.org.
is an assistant professor of computer science at Virginia Tech. His research interests include problem-solving environments, mining scientific data, and personalization. Ramakrishnan received a PhD in computer sciences from Purdue University. He is a member of the IEEE Computer Society, the ACM, and the AAAI. Contact him at email@example.com.