CSDL Home IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010 vol.7 Issue No.03 - July-September

Subscribe

Issue No.03 - July-September (2010 vol.7)

pp: 495-510

Rezaul Alam Chowdhury , The University of Texas at Austin, Austin

Hai-Son Le , Carnegie Mellon University, Pittsburgh

Vijaya Ramachandran , The University of Texas at Austin, Austin

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.94

ABSTRACT

We present efficient cache-oblivious algorithms for some well-studied string problems in bioinformatics including the longest common subsequence, global pairwise sequence alignment and three-way sequence alignment (or median), both with affine gap costs, and RNA secondary structure prediction with simple pseudoknots. For each of these problems, we present cache-oblivious algorithms that match the best-known time complexity, match or improve the best-known space complexity, and improve significantly over the cache-efficiency of earlier algorithms. We present experimental results which show that our cache-oblivious algorithms run faster than software and implementations based on previous best algorithms for these problems.

INDEX TERMS

Sequence alignment, median, RNA secondary structure prediction, dynamic programming, cache-efficient, cache-oblivious.

CITATION

Rezaul Alam Chowdhury, Hai-Son Le, Vijaya Ramachandran, "Cache-Oblivious Dynamic Programming for Bioinformatics",

*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol.7, no. 3, pp. 495-510, July-September 2010, doi:10.1109/TCBB.2008.94REFERENCES

- [1] A. Aggarwal and J. Vitter, "The Input/Output Complexity of Sorting and Related Problems,"
Comm. ACM, vol. 31, pp. 1116-1127, 1988.- [2] A. Aho, D. Hirschberg, and J. Ullman, "Bounds on the Complexity of the Longest Common Subsequence Problem,"
J. ACM, vol. 23, no. 1, pp. 1-12, 1976.- [3] T. Akutsu, "Dynamic Programming Algorithms for RNA Secondary Structure Prediction with Pseudoknots,"
Discrete Applied Math., vol. 104, pp. 45-62, 2000.- [4] S. Altschul and B. Erickson, "Optimal Sequence Alignment Using Affine Gap Costs,"
Bull. Math. Biology, vol. 48, pp. 603-616, 1986.- [5] A. Apostolico, S. Browne, and C. Guerra, "Fast Linear-Space Computations of Longest Common Subsequences,"
Theoretical Computer Science, vol. 92, no. 1, pp. 3-17, 1992.- [6] L. Bergroth, H. Hakonen, and T. Raita, "A Survey of Longest Common Subsequence Algorithms,"
Proc. Seventh String Processing and Information Retrieval (SPIRE '00), pp. 39-48, 2000.- [7] J. Cannone, S. Subramanian, M. Schnare, J. Collett, L. D'Souza, Y. Du, B. Feng, N. Lin, L. Madabusi, K. Muller, N. Pande, Z. Shang, N. Yu, and R. Gutell, "The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron, and Other RNAs,"
BMC Bioinformatics, vol. 3, no. 2,http:/www.rna.icmb.utexas. edu/, 2002.- [8] R. Chowdhury, "Algorithms and Data Structures for Cache-Efficient Computation: Theory and Experimental Evaluation," PhD thesis, Dept. of Computer Sciences, Univ. of Texas at Austin, 2007.
- [9] R. Chowdhury, H. Le, and V. Ramachandran, "Efficient Cache-Oblivious String Algorithms for Bioinformatics," Technical Report TR-07-03, Dept. of Computer Sciences, Univ. of Texas at Austin, Feb. 2007.
- [10] R. Chowdhury and V. Ramachandran, "Cache-Oblivious Dynamic Programming,"
Proc. 17th Ann. ACM-SIAM Symp. Discrete Algorithms (SODA '06), pp. 591-600, 2006.- [11] R. Chowdhury and V. Ramachandran, "The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework and Experimental Evaluation," to appear in
Theory of Computing Systems (Special Issue for SPAA '07), 2010, preliminary version appeared in Proc. 19th ACM Symp. Parallelism in Algorithms and Architectures (SPAA '07), pp. 71-80, 2007.- [12] R. Chowdhury and V. Ramachandran, "Cache-Efficient Dynamic Programming Algorithms for Multicores,"
Proc. 20th ACM Symp. Parallelism in Algorithms and Architectures (SPAA '08), pp. 207-216, 2008.- [13] R. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran, "Oblivious Algorithms for Multicores and Network of Processors," to appear in
Proc. 24th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '10), 2010.- [14] T. Cormen, C. Leiserson, R. Rivest, and C. Stein,
Introduction to Algorithms, second ed. The MIT Press, 2001.- [15] M. Crochemore, G. Landau, and M. Ziv-Ukelson, "A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices,"
SIAM J. Computing, vol. 32, no. 6, pp. 1654-1673, 2003.- [16] T. DeSantis, I. Dubosarskiy, S. Murray, and G. Andersen, "Comprehensive Aligned Sequence Construction for Automated Design of Effective Probes (Cascade-P) Using 16S rDNA,"
Bioinformatics, vol. 19, pp. 1461-1468, http://greengenes.llnl. gov16S/, 2003.- [17] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison,
Biological Sequence Analysis. Cambridge Univ. Press, 1998.- [18] M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran, "Cache-Oblivious Algorithms,"
Proc. 40th Ann. IEEE Symp. Foundations of Computer Science (FOCS '99), pp. 285-297, 1999.- [19] M. Frigo and V. Strumpen, "Cache-Oblivious Stencil Computations,"
Proc. 19th ACM Int'l Conf. Supercomputing (ICS '05), pp. 361-366, 2005.- [20] O. Gotoh, "An Improved Algorithm for Matching Biological Sequences,"
J. Molecular Biology, vol. 162, pp. 705-708, 1982.- [21] J. Grice, R. Hughey, and D. Speck, "Reduced Space Sequence Alignment,"
Computer Applications in the Biosciences, vol. 13, no. 1, pp. 45-53, 1997.- [22] D. Gusfield,
Algorithms on Strings, Trees and Sequences. Cambridge Univ. Press, 1997.- [23] D. Hirschberg, "A Linear Space Algorithm for Computing Maximal Common Subsequences,"
Comm. ACM, vol. 18, no. 6, pp. 341-343, 1975.- [24] D. Hirschberg, "An Information Theoretic Lower Bound for the Longest Common Subsequence Problem,"
Information Processing Letters, vol. 7, no. 1, pp. 40-41, 1978.- [25] J. Hong and H. Kung, "I/O Complexity: The Red-Blue Pebble Game,"
Proc. 13th Ann. ACM Symp. Theory of Computation (STOC '81), pp. 326-333, 1981.- [26] J. Kleinberg and E. Tardos,
Algorithm Design. Addison-Wesley, 2005.- [27] B. Knudsen,
Multiple Parsimony Alignment with "affalign", software package multalign.tar, 2008.- [28] B. Knudsen, "Optimal Multiple Parsimony Alignment with Affine Gap Cost Using a Phylogenetic Tree,"
Proc. Third Workshop Algorithms in Bioinformatics (WABI '03), pp. 433-446, 2003.- [29] S. Kumar and C. Rangan, "A Linear-Space Algorithm for the LCS Problem,"
Acta Informatica, vol. 24, pp. 353-362, 1987.- [30] H. Le, "Algorithms for Identification of Patterns in Biogeography and Median Alignment of Three Sequences in Bioinformatics," undergraduate honors thesis, Dept. of Computer Sciences, Univ. of Texas at Austin, CS-TR-06-29, 2006.
- [31] R. Lyngsø and C. Pedersen, "RNA Pseudoknot Prediction in Energy-Based Models,"
J. Computational Biology, vol. 7, no. 3/4, pp. 409-427, 2000.- [32] D. Maier, "The Complexity of Some Problems on Subsequences and Supersequences,"
J. ACM, vol. 25, no. 2, pp. 322-336, 1978.- [33] W. Masek and M. Paterson, "A Faster Algorithm for Computing String Edit Distances,"
J. Computer and System Sciences, vol. 20, no. 1, pp. 18-31, 1980.- [34] E. Myers and W. Miller, "Optimal Alignments in Linear Space,"
Computer Applications in the Biosciences, vol. 4, no. 1, pp. 11-17, 1988.- [35] W. Pearson and D. Lipman, "Improved Tools for Biological Sequence Comparison,"
Proc. Nat'l Academy of Sciences USA, vol. 85, pp. 2444-2448, 1988.- [36] C. Pedersen, "Algorithms in Computational Biology," PhD thesis, Dept. of Computer Science, Univ. of Aarhus, 1999.
- [37] D. Powell,
Software Package, align3str_checkp.tar.gz, 2008.- [38] D. Powell, L. Allison, and T. Dix, "Fast, Optimal Alignment of Three Sequences Using Linear Gap Cost,"
J. Theoretical Biology, vol. 207, no. 3, pp. 325-336, 2000.- [39] E. Rivas and S. Eddy, "A Dynamic Programming Algorithm for RNA Structure Prediction Including Pseudoknots,"
J. Molecular Biology, vol. 285, no. 5, pp. 2053-2068, 1999.- [40] J. Seward and N. Nethercote,
Valgrind (Debugging and Profiling Tool for x86-Linux Programs), http://valgrind.kde.orgindex.html, 2008.- [41] J. Thomas et al., "Comparative Analyses of Multi-Species Sequences from Targeted Genomic Regions,"
Nature, vol. 424, pp. 788-793, 2003.- [42] M. Waterman,
Introduction to Computational Biology. Chapman and Hall, 1995. |