loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
On Subset Seeds for Protein Alignment
July-September 2009 (vol. 6 no. 3)
pp. 483-494
Mikhail Roytberg, Institute of Mathematical Problems in Biology, Pushchino, Moscow
Anna Gambin, Warsaw University, Poland
Laurent Noé, LIFL/CNRS/INRIA, France
Slawomir Lasota, Warsaw University, Poland
Eugenia Furletova, Institute of Mathematical Problems in Biology, Pushchino, Moscow
Ewa Szczurek, Max Planck Institute for Molecular Genetics, Berlin
Gregory Kucherov, LIFL/CNRS/INRIA, France
We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard Blastp seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds versus Blastp.

[1] G. Kucherov, L. Noé, and M. Roytberg, “A Unifying Framework for Seed Sensitivity and Its Application to Subset Seeds,” J.Bioinformatics and Computational Biology, vol. 4, no. 2, pp. 553-570, Apr. 2006 (preliminary version in WABI '05).
[2] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, pp.403-410, 1990.
[3] S. Altschul, T. Madden, A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.
[4] D. Brown, “Optimizing Multiple Seed for Protein Homology Search,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 1, pp. 29-38, Jan. 2005 (early version appeared in WABI '04).
[5] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
[6] W.J. Kent, “BLAT—The BLAST-Like Alignment Tool,” Genome Research, vol. 12, pp. 656-664, 2002.
[7] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: Highly Sensitive and Fast Homology Search,” J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-439, 2004 (earlier version in GIW '03).
[8] L. Noé and G. Kucherov, “YASS: Enhancing the Sensitivity of DNA Similarity Search,” Nucleic Acid Research, vol. 33, pp. W540-W543, 2005.
[9] D. Mak, Y. Gelfand, and G. Benson, “Indel Seeds for Homology Search,” Bioinformatics, vol. 22, no. 14, pp. e341-e349, 2006.
[10] M. Csürös and B. Ma, “Rapid Homology Search with Neighbor Seeds,” Algorithmica, vol. 48, no. 2, pp. 187-202, June 2007.
[11] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension to Spaced Seeds,” J. Computer and System Sciences, vol. 70, no. 3, pp.364-380, 2005.
[12] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for DNA Similarity Search,” Proc. Eighth Ann. Int'l Conf. Computational Molecular Biology (RECOMB '04), Mar. 2004.
[13] G. Kucherov, L. Noé, and M. Roytberg, “Multi-Seed Lossless Filtration,” Proc. 15th Ann. Combinatorial Pattern Matching Symp. (CPM '04), pp. 297-310, July 2004.
[14] I.-H. Yang, S.-H. Wang, Y.-H. Chen, P.-H. Huang, L. Ye, X. Huang, and K.-M. Chao, “Efficient Methods for Generating Optimal Single and Multiple Spaced Seeds,” Proc. IEEE Fourth Symp. Bioinformatics and Bioeng. (BIBE '04), pp. 411-416, 2004.
[15] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple Spaced Seeds for Homology Search,” Proc. 15th Symp. Combinatorial Pattern Matching, pp. 47-58, July 2004.
[16] D. Kisman, M. Li, B. Ma, and L. Wang, “tPatternHunter: Gapped, Fast and Sensitive Translated Homology Search,” Bioinformatics, vol. 21, no. 4, pp. 542-544, 2005.
[17] P. Peterlongo, L. Noé, D. Lavenier, G. Georges, J. Jacques, G. Kucherov, and M. Giraud, “Protein Similarity Search with Subset Seeds on a Dedicated Reconfigurable Hardware,” Proc. Second Workshop Parallel Computational Biology, 2007.
[18] V.H. Nguyen and D. Lavenier, “Speeding Up Subset Seed Algorithm for Intensive Protein Sequence Comparison,” Proc. Sixth IEEE Int'l Conf. Research, Innovation & Vision for the Future (RIVF '08), pp. 57-63, 2008.
[19] L. Noé and G. Kucherov, “Improved Hit Criteria for DNA Local Alignment,” BMC Bioinformatics, vol. 5, no. 149, Oct. 2004.
[20] L. Zhou, J. Stanton, and L. Florea, “Universal Seeds for cDNA-to-Genome Comparison,” BMC Bioinformatics, vol. 9, no. 36, 2008.
[21] B. Ma and H. Yao, “Seed Optimization is No Easier Than Optimal Golomb Ruler Design,” Proc. Sixth Asia Pacific Bioinformatics Conf. (APBC '08), pp. 133-144, Jan. 2008.
[22] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds for Similarity Search,” Discrete Applied Math., vol. 138, no. 3, pp. 253-263, 2004 (preliminary version in 2002).
[23] T. Li, K. Fan, J. Wang, and W. Wang, “Reduction of Protein Sequence Complexity by Residue Grouping,” J. Protein Eng., vol. 16, pp. 323-330, 2003.
[24] L. Murphy, A. Wallqvist, and R. Levy, “Simplified Amino Acid Alphabets for Protein Fold Recognition and Implications for Folding,” J. Protein Eng., vol. 13, pp. 149-152, 2000.
[25] S. Cheng and Y.-F. Xu, “Constrained Independence System and Triangulations of Planar Point Sets,” Proc. Computing and Combinatorics, pp. 41-50, 1995.
[26] S. Henikoff and J. Henikoff, “Amino Acid Substitution Matrices from Protein Blocks,” Proc. Nat'l Academy of Sciences USA, vol. 89, pp. 10915-10919, 1992.
[27] S. Henikoff and J. Henikoff, “Automated Assembly of Protein Blocks for Database Searching,” Nucleic Acids Research, vol. 19, no. 23, pp. 6565-6572, 1991.
[28] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity Search in Genomic DNA,” Proc. Seventh Ann. Int'l Conf. Computational Molecular Biology (RECOMB '03), pp. 67-75, Apr. 2003.
[29] L. Ilie and S. Ilie, “Long Spaced Seeds for Finding Similarities Between Biological Sequences,” Proc. Second Int'l Conf. Bioinformatics & Computational Biology (BIOCOMP '07), pp. 3-8, 2007.
[30] A. Bahr, J. Thompson, J. Thierry, and O. Poch, “BAliBASE (Benchmark Alignment dataBASE): Enhancements for Repeats, Transmembrane Sequences and Circular Permutations,” Nucleic Acids Research, vol. 29, no. 1, pp. 323-326, 2001.
[31] A. Stebbings and K. Mizuguchi, “HOMSTRAD: Recent Developments of the Homologous Protein Structure Alignment Database,” Nucleic Acids Research, vol. 32, pp. D203-D207, 2004.
[32] A.R. Subramanian, J. Weyer-Menkhoff, M. Kaufmann, and B. Morgenstern, “DIALIGN-T: An Improved Algorithm for Segment-Based Multiple Sequence Alignment,” BMC Bioinformatics, vol. 6, no. 66, 2005.
[33] G. Raghava, S. Searle, P. Audley, J. Barber, and G. Barton, “OXBench: A Benchmark for Evaluation of Protein Multiple Sequence Alignment Accuracy,” BMC Bioinformatics, vol. 4, no. 47, 2003.
[34] R. Finn, J. Mistry, B. Schuster-Bckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S. Eddy, E. Sonnhammer, and A. Bateman, “PFAM: Clans, Web Tools and Services,” Nucleic Acids Research, vol. 34, pp. D247-D251, 2006.
[35] R.C. Edgar, “MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput,” Nucleic Acids Research, vol. 32, no. 5, pp. 1792-1797, 2004.
[36] I. Letunic, R. Copley, B. Pils, S. Pinkert, J. Schultz, and P. Bork, “SMART 5: Domains in the Context of Genomes and Networks,” Nucleic Acids Research, vol. 34, no. 1, pp. D257-D260, 2006.
[37] R. Nunez Miguel, J. Shi, and K. Mizuguchi, “Protein Fold Recognition and Comparative Modeling using HOMSTRAD, JOY and FUGUE,” Protein Structure Prediction: Bioinformatic Approach, pp. 143-169, Int'l University Line Publishers, 2001.
[38] S. Shiryev, J. Papadopoulos, A. Schäffer, and R. Agarwala, “Improved BLAST Searches Using Longer Words for Protein Seeding,” Bioinformatics, vol. 23, no. 21, pp. 2949-2951, 2007.
[39] P. Peterlongo, L. Noé, D. Lavenier, N.V.H. , G. Kucherov, and M. Giraud, “Optimal Neighborhood Indexing for Protein Similarity Search,” BMC Bioinformatics, vol. 9, no. 534, 2008.

Index Terms:
Protein sequences, protein databases, local alignment, similarity search, seeds, subset seeds, multiple seeds, seed alphabet, sensitivity, selectivity.
Citation:
Mikhail Roytberg, Anna Gambin, Laurent Noé, Slawomir Lasota, Eugenia Furletova, Ewa Szczurek, Gregory Kucherov, "On Subset Seeds for Protein Alignment," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 3, pp. 483-494, July-Sept. 2009, doi:10.1109/TCBB.2009.4
Usage of this product signifies your acceptance of the Terms of Use.