This Article 
 Bibliographic References 
 Add to: 
High-Performance Direct Pairwise Comparison of Large Genomic Sequences
August 2006 (vol. 17 no. 8)
pp. 764-772

Abstract—Many applications in Comparative Genomics lend themselves to implementations that take advantage of common high-performance features in modern microprocessors. However, the common suggestion that a data-parallel, multithreaded, or high-throughput implementation is possible often ignores the complexity of actually creating such software. In this paper, we present two parallel algorithms for a classic comparative genomics algorithm, the dot plot. First, we describe a data-parallel algorithm that achieves speedups of up to 14.4x over the sequential version for large genomic comparisons. Then, we use the new algorithm as the base for a coarse-grained parallel version, suitable for multiprocessor and cluster environments, that scales linearly with the number of processors. These speedups introduce the opportunity to perform full pairwise comparisons on entire genomes on a much larger scale than previously possible. We also present the experimental, model-driven approach used to develop the algorithm that allowed us to carefully study and evaluate implementation options and to fully understand the parameters affecting its performance.

[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, pp. 403-410, 1990.
[2] (ACG) Apple Advanced Computation Group, Apple/Genentech BLAST, Apple, Cupertino, Calif.,, 2002.
[3] (ADC) Apple Developer's Connection, Velocity Engine and Xcode, Apple Developer Connection, Cupertino, Calif., http:/, 2004.
[4] C.E. Barry III and B.G. Schroeder, “DNA Microarray: Translation Tools for Understanding the Biology of Mycobacterium Tuberculosis,” Trends in Microbiology, vol. 8, no. 5, pp. 209-210, 2000.
[5] S. Batzoglou, L. Pachter, J.P. Mesirov, B. Berger, E.S. Lander, “Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction,” Genome Research, vol. 10, no. 7, pp. 950-958, 2000.
[6] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, B.A. Rapp, and D.L. Wheeler, “GenBank,” Nucleic Acids Research, vol. 28, pp. 15-18, 2000.
[7] A. Berns, “Gene Expression in Diagonsis,” Nature, vol. 403, pp. 491-492, 2000.
[8] M. Brudno, C.B. Do, G.M. Cooper, M.F. Kim, E. Davydov, E.D. Green, A. Sidow, and S. Batzoglou, “LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA,” Genome Research, vol. 13, pp. 721-731, 2003.
[9] J.H. Choi, H.G. Cho, and S. Kim, “GAME: A Simple and Efficient Whole Genome Alignment Method Using Maximal Exact Match Filtering,” Computational Biology and Chemistry, vol. 29, pp. 244-253, 2005.
[10] O. Couronne, A. Poliakov, N. Bray, T. Ishkhanov, D. Ryaboy, E. Rubin, L. Pachter, and I. Dubchak, “Strategies and Tools for Whole-Genome Alignments,” Genome Research, vol. 13, pp. 73-80, 2003.
[11] R. Dalton, “DIY (Do it Yourself) Microarrays Promise DNA Chips with Everything,” Nature, vol. 403, p. 234, 2000.
[12] B. Dawes and D. Abrahams, Boost,, 2004.
[13] A.L. Delcher, A. Phillippy, J. Carlton, and S.L. Salzberg, “Fast Algorithms for Large-Scale Genome Alignment and Comparision,” Nucleic Acids Research, vol. 30, no. 11, pp. 2478-2483, 2002.
[14] K.A. Frazer, L. Elnitski, D.M. Church, I. Dubchak, and R. Hardison, “Cross-Species Sequence Comparisons: A Review of Methods and Available Resources,” Genome Research, vol. 13, no. 1, pp. 1-12, 2003.
[15] A.J. Gibbs and G.A. McIntyre, “The Diagram, a Method for Comparing Sequences, Its Use with Amino Acid and Nucleotide Sequences,” European J. Biochemistry, vol. 16, pp. 1-11, 1970.
[16] IBM, Cell Processor Technical Details Unveiled by IBM, Sony, and Toshiba, cell/, Feb. 2005.
[17] Intel, A-32 Intel® Architecture Software Developer's Manual, vol. 1: Basic Architecture, IA-32 Intel Architecture Software Developer's Manuals, Intel, 2004, manualsindex_new.htm.
[18] N. Jareborg, E. Birney, and R. Durbin, “Comparative Analysis of Noncoding Regions of 77 Orthologous Mouse and Human Gene Pairs,” Genome Research, vol. 9, no. 9, pp. 815-824, 1999.
[19] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. Lander, “Sequencing and Comparison of Yeast Species to Identify Genes and Regulatory Elements,” Nature, vol. 423, pp. 241-254, 2003.
[20] W.J. Kent, “BLAT–The BLAST-Like Alignment Tool,” Genome Research, vol. 12, no. 4, pp. 656-664, 2002.
[21] W.J. Kent and A.M. Zahler, “Conservation, Regulation, synteny, and Introns in Large-Scale C. briggsae-C. elegans Genomic Alignment,” Genome Research, vol. 10, no. 8, pp. 1115-1125, 2000.
[22] E. Lindahl Altivec HMMer, Version 2, Lindahl Lab Web Site, Jan. 2005, altivec-hmmer-version-2.html.
[23] J.V. Maizel Jr. and R.P. Lenk, “Enhanced Graphic Matrix Analysis of Nucleic Acid and Protein Sequences,” Proc. Nat'l Academy of Sciency US, vol. 78, pp. 7665-7669, 1981.
[24] W. Miller, “Comparison of Genomic DNA Sequences: Solved and Unsolved Problems,” Bioinformatics, vol. 17, no. 5, pp. 391-397, 2001.
[25] B. Morgenstern, K. Frech, A. Dress, and T. Werner, “DIALIGN: Finding Local Similarities by Multiple Sequence Alignment,” Bioinformatics, vol. 14, no. 3, pp. 290-294, 1998.
[26] Z. Ning, A.J. Cox, and J.C. Mullikin, “SSAHA: A Fast Search Method for Large DNA Databases,” Genome Research, vol. 11, no. 10, pp. 1725-1729, 2001.
[27] W.R. Pearson and D.J. Lipman, “Improved Tools for Biological Sequence Comparison,” Proc. Nat'l Academy of Science US, vol. 85, pp. 2444-2448, 1988.
[28] T. Rognes and E. Seeberg, “Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors,” Bioinformatics, vol. 16, pp. 699-706, 2000.
[29] S. Schwartz, Z Zhang, K.A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and R. Miller, “PipMaker: A Web Server for Aligning Two Genomic DNA Sequences,” Genome Research, vol. 10, no. 4, pp. 577-586, 2000.
[30] T.F. Smith and M.S. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[31] E. Sonnhammer and R. Durbin, “A Dot-Matrix Program with Dynamic Threshold Control Suited for Genomic DNA and Protein-Sequence Analysis,” Gene-Combis, vol. 167, pp. 1-10, 1995.

Index Terms:
Dot plot, data-parallel, pairwise comparison, sequence alignment, vector processor, Altivec, high-performance computing, comparative genomics, performance measures.
Christopher Mueller, Mehmet M. Dalkilic, Andrew Lumsdaine, "High-Performance Direct Pairwise Comparison of Large Genomic Sequences," IEEE Transactions on Parallel and Distributed Systems, vol. 17, no. 8, pp. 764-772, Aug. 2006, doi:10.1109/TPDS.2006.104
Usage of this product signifies your acceptance of the Terms of Use.