This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis
August 2006 (vol. 17 no. 8)
pp. 740-749
Jarek Nieplocha, IEEE Computer Society

Abstract—Genes in an organism's DNA (genome) have embedded in them information about proteins, which are the molecules that do most of a cell's work. A typical bacterial genome contains on the order of 5,000 genes. Mammalian genomes can contain tens of thousands of genes. For each genome sequenced, the challenge is to identify protein components (proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused on unlocking protein information embedded in the genetic code, making it possible to assemble a "tree of life” by comparing new sequences against all sequences from known organisms. But, the memory footprint of sequence data is growing more rapidly than per-node core memory. Despite years of research and development, high-performance sequence alignment applications either do not scale well, cannot accommodate very large databases in core, or require special hardware. We have developed a high-performance sequence alignment application, ScalaBLAST, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability. ScalaBLAST relies on a collection of techniques—distributing the target database over available memory, multilevel parallelism to exploit concurrency, parallel I/O, and latency hiding through data prefetching—to achieve high-performance and scalability. This demonstrated approach of database sharing combined with effective task scheduling should have broad ranging applications to other informatics-driven sciences.

[1] T. Smith and M. Waterman, “Overlapping Genes and Information Theory,” J. Theoretical Biology, vol. 91, pp. 379-380, 1981.
[2] T. Smith, M. Waterman, and W. Fitch, “Comparative Biosequence Metrics,” J. Molecular Evolution, vol. 18, pp. 38-46, 1981.
[3] S. Needleman and C. Wunsch, “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins,” J. Molecular Biology, vol. 48, pp. 443-453, 1970.
[4] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, pp. 403-410, 1990.
[5] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.
[6] S. Brenner, C. Chothia, and T.J.P. Hubbard, “Assessing Sequence Comparison Methods with Reliable Structurally Identified Distant Evolutionary Relationships,” Proc. Nat'l Academy of Science US, vol. 95, pp. 6073-6078, 1998.
[7] B. Webb, J. Liu, and C. Lawrence, “BALSA: Bayesian Algorithm for Local Sequence Alignment,” Nucleic Acids Research, vol. 30, pp. 1268-1277, 2002.
[8] A. Krause, J. Stoye, and M. Vingron, “Large Scale Hierarchical Clustering of Protein Sequences,” BMC Bioinformatics, vol. 6, p. 15, 2005.
[9] H. Sofia, G. Chen, B. Hetzler, J. Reyes-Spindola, and N. Miller, “Radical SAM, A Novel Protein Superfamily Linking Unresolved Steps in Familiar Biosynthetic Pathways with Radical Mechanisms: Functional Characterization Using New Analysis and Information Visualization Methods,” Nucleic Acids Research, vol. 29, pp. 1097-1106, 2001.
[10] A. Darling, L. Carey, and W.-C. Feng, “The Design, Implementation, and Evaluation of mpiBLAST,” Proc. ClusterWorld, 2003.
[11] N. Camp, H. Cofer, and R. Gomperts, “High-Throughput BLAST,” 1998.
[12] J. Nieplocha, R. Harrison, and R. Littlefield, “Global Arrays: A Nonuniform Memory Access Programming Model for High-Performance Computers,” J. Supercomputing, vol. 10, pp. 197-220, 1996.
[13] X. Meng and V. Chaudhary, “Bio-Sequence Analysis with Cradle's 3SoCTM Software Scalable System on Chip,” Proc. ACM Symp. Applied Computing, 2004.
[14] K. Muriki, K. Underwood, and R. Sass, “RC-BLAST: Towards a Portable, Cost-Effective Open Source Hardware Implementation,” Proc. HICOMB 2005, Fourth IEEE Int'l Workshop High-Performance Computational Biology, 2005.
[15] J. Wang and Q. Mu, “Soap-HT-BLAST: High-Throughput BLAST Based on Web Services,” Bioinformatics, vol. 19, pp. 1863-1864, 2003.
[16] R. Bjornson, A. Sherman, S. Weston, N. Willard, and J. Wing, “TurboBLAST(r): A Parallel Implementation of BLAST Built on the TurboHub,” Proc. 16th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2002.
[17] T. Braun, T. Scheetz, G. Webster, A. Clark, E. Stone, V. Sheffield, and T. Casavant, “Identifying Candidate Disease Genes with High-Performance Computing,” J. Supercomputing, vol. 26, pp. 7-24, 2003.
[18] “Cluster Computing: SGI Altix Screams on Itanium,” HPC Wire, vol. 12, 2003.
[19] H. Nicholas, G. Giras, V. Hartonas-Garmhausen, M. Kopko, C. Maher, and A. Ropelewski, “Distributing the Comparison of DNA and Protein Sequences across Heterogeneous Supercomputers,” Proc. ACM/IEEE Conf. Supercomputing, 1991.
[20] T. Rognes, “ParAlign: A Parallel Sequence Alignment Algorithm for Rapid and Sensitive Database Searches,” Nucleic Acids Research, vol. 29, pp. 1647-1652, 2001.
[21] M. Salisbury, “Parallel Blast: Chopping the Database,” Genome Technology, pp. 21-22, 2005.
[22] H. Lin, X. Ma, P. Chandramohan, A. Geist, and N. Samatova, “Efficient Data Access for Parallel BLAST,” Proc. 19th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2005.
[23] D. Mathog, “Parallel BLAST on Split Databases,” Bioinformatics, vol. 19, pp. 1865-1866, 2003.
[24] K. Hokamp, D. Shields, K. Wolfe, and D. Caffrey, “Wrapping Up BLAST and Other Applications for Use on UNIX Clusters,” Bioinformatics, vol. 19, pp. 441-442, 2003.
[25] W. Gish, 1996-2004.
[26] J. Grant, R.J. Dunbrack, F. Manion, and M. Ochs, “BeoBLAST: Distributed BLAST and PSI-BLAST on a Beowulf Cluster,” Bioinformatics, vol. 18, pp. 765-766, 2002.
[27] M. Schmollinger, K. Nieselt, M. Kaufmann, and B. Morgenstern, “DIALIGN P: Fast Pair-Wise and Multiple Sequence Alignment Using Parallel Processors,” BMC Bioinformatics, vol. 5, p. 128, 2004.
[28] C. Wang and E. Lefkowitz, “SS-Wrapper: A Package of Wrapper Applications for Similarity Searches on Linux Clusters,” BMC Bioinformatics, vol. 5, p. 171, 2004.
[29] R. Braun, K. Pedretti, T. Casavant, T. Scheetz, C. Birkett, and C. Roberts, “Parallelization of Local BLAST Service on Workstation Clusters,” Future Generation Computer Systems, vol. 17, 2001.
[30] X. Cao, S.C. Li, B.C. Ooi, and A.K.H. Tung, “Piers: An Efficient Model for Similarity Search in DNA Sequence Databases,” SIGMOD Record, vol. 33, pp. 39-44, 2004.
[31] R. Costa and S. Lifschitz, “Database Allocation Strategies for Parallel BLAST Evaluation on Clusters,” Distributed and Parallel Databases, vol. 13, pp. 99-127, 2003.
[32] J. Nieplocha, J. Ju, and T. Straatsma, “A Multiprotocol Communication Support for the Global Address Space Programming Model on the IBM SP,” Proc. Euro-Par, 2000.
[33] J. Nieplocha, M. Krishnan, B. Palmer, V. Tipparaju, and Y. Zhang, “Exploiting Processor Groups to Extend Scalability of the GA Shared Memory Programming Model,” Proc. ACM SIGMicro Computing Frontiers, 2005.
[34] L. Oliker, A. Canning, J. Carter, J. Shalf, and S. Ethier, “Scientific Computations on Modern Parallel Vector System,” Proc. ACM/IEEE SuperComputing Conf. '04, 2004.
[35] J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Apra, “Advances, Applications, and Performance of the Global Arrays Shared Memory Programming Toolkit,” Int'l J. High-Performance Computing Applications, vol. 20, pp. 203-231, Summer 2006.
[36] A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L. Sonnhammer, D.J. Studholme, C. Yeats, and E. SR., “The PFam Protein Families Database,” Nucleic Acids Research, vol. 32, pp. D138-D141, 2004.

Index Terms:
High-performance sequence alignment, BLAST, Global Arrays.
Citation:
Christopher Oehmen, Jarek Nieplocha, "ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis," IEEE Transactions on Parallel and Distributed Systems, vol. 17, no. 8, pp. 740-749, Aug. 2006, doi:10.1109/TPDS.2006.112
Usage of this product signifies your acceptance of the Terms of Use.