Issue No. 03 - May/June (1999 vol. 1)
Over the past 10 to 15 years, numerous technical innovations have revolutionized molecular biology. These include the development of high-throughput DNA amplification techniques, improved DNA sequencing methods, more efficient procedures for determining protein structures, and new technologies for monitoring cellular processes at a global level. These breakthroughs have led to an explosion in the variety and quantity of data routinely generated in molecular biology research.
Previous data-storage and analysis methods, often no more sophisticated than spreadsheets or visual inspection, do not scale well to the new volume of data. Furthermore, most laboratory procedures have a non-zero failure rate, and the biological processes being monitored are inherently noisy. Developing analytical tools robust enough to deal with the problems of incorrect, noisy, or missing data has become a high priority. Biologists are taking increasing advantage of computer and computational science to help meet these analytic and data-management needs.
For this special issue, we have included discussions of several representative areas of current research in molecular biology. Computation has already impacted all of these fields; however, many opportunities remain in which computational scientists can make significant contributions. These articles also illustrate some of the many ways that computational biologists grapple with problems of increasing scale and of handling noisy or imperfect data.
Software engineering can significantly contribute to the design of tools that let biomedical researchers access and control data. Biology laboratories, especially those with limited software-development resources, often have difficulty rapidly customizing commercial database packages to handle complex biological data. Lincoln Stein and Jean Thierry-Mieg introduce AceDB (a Caenorhabditis elegans database), a system specifically designed to address this problem. In addition to describing the AceDB system's architecture, data model, query languages, and API, their article gives the reader a flavor of the rapid academic-to-production software evolution process increasingly common to bioinformatics.
The effects of greater data availability are perhaps most visible in the area of genomics, a term that broadly refers to the determination and analysis of various organisms' DNA sequences. Knowing the human genome sequence, and that of model organisms such as the mouse, will dramatically impact our understanding of fields ranging from medical diagnosis, drug development, and disease treatment to evolution and anthropology. Gene Myers presents an overview of the problem of determining an entire organism's DNA sequence. The algorithmic and computational challenges he describes are exacerbated by the complex, repetitive nature of genomic DNA, by inexact laboratory techniques, and by the sheer scale of current projects.
Even without the completed human genome sequence in hand, we can learn a great deal about genetic causes of disease by studying family histories. Mark Daly describes methods for tracking down genes associated with complex diseases by analyzing the disease histories of large families. This problem tends to scale exponentially with either the chosen method's statistical power or the size of the family being studied. Although his article describes the state-of-the-art software that addresses this problem, it also stresses current methods' limitations, highlighting a problem ripe for future analytical breakthroughs.
Proteins are often described as the building blocks of life and cells as the basic sites of protein activity. Naturally, there is great interest in deriving a better understanding of protein behavior at the cellular level. Charles DeLisi and Sandor Vajda introduce a variety of computational challenges in cell biology. They describe analogies between cell-signaling pathways and complex networks of electronic circuitry, survey current work modeling various cellular processes, and introduce important open problems in predicting cellular activity and analyzing biological circuitry.
We have endeavored to present a range of problems that fall under the headings of computational biology or bioinformatics. We were unable to address many additional areas in computational biology that offer important and exciting opportunities for computational scientists. Such areas include predicting protein structure and folding dynamics, reconstructing evolutionary trees from genomic data, interpreting patterns of gene expression in cells, and numerous applications of machine-learning and data-mining techniques to extract biological results from massive data sets.
We hope this special issue will inspire readers to further explore and participate in this challenging area of computational science.
Jill P. Mesirov is associate director of the Whitehead/MIT Center for Genome Research, where she is responsible for the informatics, computational science, and research computing programs. She is also an adjunct professor of computer science at Boston University. She spent many years working on high-performance computing and developing parallel algorithms relevant to problems that arise in science, engineering, and business applications. Her current research interest is in computational biology and bioinformatics. She holds an AB from the University of Pennsylvania and an MA and PhD from Brandeis University, all in mathematics. She is a member of the American Mathematical Society, American Association for the Advancement of Science, Association for Computing Machinery, SIGACT, IEEE Computer Society, and Society for Industrial and Applied Mathematics. Contact her at the Whitehead/MIT Center for Genome Research, One Kendall Square, Bldg. 300, Cambridge, MA 02139-1561; firstname.lastname@example.org.
Donna K. Slonim is a research scientist at the Whitehead/MIT Center for Genome Research. Her current research is developing computational methods for analyzing and interpreting gene expression data. Other interests include the application of computational learning techniques and algorithm design to other problems in biology and medicine (understanding gene regulation, predicting drug metabolism rates, directing medical diagnosis and treatment, and so on), and more general problems of machine learning with noisy data. She received her BS from Yale, her MS from the University of California at Berkeley, and her PhD from MIT, all in computer science. She's a member of the American Association for the Advancement of Science. Contact her at the Whitehead/MIT Center for Genome Research, One Kendall Square, Bldg. 300, Cambridge, MA 02139-1561; email@example.com.