The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - Sept.-Oct. (2013 vol.10)
pp: 1098-1112
K. S. M. Tozammel Hossain , Dept. of Comput. Sci., Virginia Tech, Blacskburg, VA, USA
Debprakash Patnaik , Amazon Inc., Seattle, WA, USA
Srivatsan Laxman , Microsoft Res., Bangalore, India
Prateek Jain , Microsoft Res., Bangalore, India
Chris Bailey-Kellogg , Dept. of Comput. Sci., Dartmouth Coll., Hanover, NH, USA
Naren Ramakrishnan , Dept. of Comput. Sci., Virginia Tech, Blacskburg, VA, USA
ABSTRACT
We present alignment refinement by mining coupled residues (ARMiCoRe), a novel approach to a classical bioinformatics problem, viz., multiple sequence alignment (MSA) of gene and protein sequences. Aligning multiple biological sequences is a key step in elucidating evolutionary relationships, annotating newly sequenced segments, and understanding the relationship between biological sequences and functions. Classical MSA algorithms are designed to primarily capture conservations in sequences whereas couplings, or correlated mutations, are well known as an additional important aspect of sequence evolution. (Two sequence positions are coupled when mutations in one are accompanied by compensatory mutations in another). As a result, better exposition of couplings is sometimes one of the reasons for hand-tweaking of MSAs by practitioners. ARMiCoRe introduces a distinctly pattern mining approach to improving MSAs: using frequent episode mining as a foundational basis, we define the notion of a coupled pattern and demonstrate how the discovery and tiling of coupled patterns using a max-flow approach can yield MSAs that are better than conservation-based alignments. Although we were motivated to improve MSAs for the sake of better exposing couplings, we demonstrate that our MSAs are also improvements in terms of traditional metrics of assessment. We demonstrate the effectiveness of ARMiCoRe on a large collection of data sets.
INDEX TERMS
Sequential analysis, Hidden Markov models, Classification algorithms, Amino acids, Bioinformatics, Proteins,max-flow problems, Multiple sequence alignment, coupled residues, pattern set mining, coupled patterns
CITATION
K. S. M. Tozammel Hossain, Debprakash Patnaik, Srivatsan Laxman, Prateek Jain, Chris Bailey-Kellogg, Naren Ramakrishnan, "Improved Multiple Sequence Alignments Using Coupled Pattern Mining", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 5, pp. 1098-1112, Sept.-Oct. 2013, doi:10.1109/TCBB.2013.36
REFERENCES
[1] T. Hubbard, A. Lesk, and A. Tramontano, "Gathering Them in to the Fold," Nature Structural Biology, vol. 3, no. 4, article 313, 1996.
[2] J. Thompson, D. Higgins, and T. Gibson, "CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-specific Gap Penalties and Weight Matrix Choice," Nucleic Acids Research, vol. 22, pp. 4673-4680, 1994.
[3] A. Löytynoja and M. Milinkovitch, "A Hidden Markov Model for Progressive Multiple Alignment," Bioinformatics, vol. 19, no. 12, pp. 1505-1513, 2003.
[4] R. Edgar, "MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput," Nucleic Acids Research, vol. 32, pp. 1792-1797, 2004.
[5] C. Notredame, D. Higgins, and J. Heringa, "T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment," J. Molecular Biology, vol. 302, no. 1, pp. 205-217, 2000.
[6] B. Bringmann, S. Nijssen, N. Tatti, J. Vreeken, and A. Zimmermann, "Mining Sets of Patterns: A Tutorial at ECMLPKDD 2010, Barcelona, Spain," http://www.cs.kuleuven.be/conferencemsop /, 2013.
[7] R. Edgar and S. Batzoglou, "Multiple Sequence Alignment," Current Opinion in Structural Biology, vol. 16, no. 3, pp. 368-373, 2006.
[8] C. Do and K. Katoh, "Protein Multiple Sequence Alignment," Functional Proteomics, vol. 484, pp. 379-413, 2008.
[9] K. Katoh, K. Misawa, K. Kuma, and T. Miyata, "MAFFT: A Novel Method for Rapid Multiple Sequence Alignment Based on Fast Fourier Transform," Nucleic Acids Research, vol. 30, no. 14, pp. 3059-3066, 2002.
[10] C. Do, M. Mahabhashyam, M. Brudno, and S. Batzoglou, "ProbCons: Probabilistic Consistency-Based Multiple Sequence Alignment," Genome Research, vol. 15, pp. 330-340, 2005.
[11] B. Morgenstern, "DIALIGN: Multiple DNA and Protein Sequence Alignment at BiBiServ," Nucleic Acids Research, vol. 32, no. suppl 2, pp. W33-W36, 2004.
[12] A. Subramanian, J. Weyer-Menkhoff, M. Kaufmann, and B. Morgenstern, "DIALIGN-T: An Improved Algorithm for Segment-Based Multiple Sequence Alignment," BMC Bioinformatics, vol. 6, no. 1, article 66, 2005.
[13] C. Lee, C. Grasso, and M. Sharlow, "Multiple Sequence Alignment Using Partial Order Graphs," Bioinformatics, vol. 18, no. 3, pp. 452-464, 2002.
[14] H. Carrillo and D. Lipman, "The Multiple Sequence Alignment Problem in Biology," SIAM J. Applied Math., vol. 48, no. 5, pp. 1073-1082, 1988.
[15] L. Wang and T. Jiang, "On the Complexity of Multiple Sequence Alignment," J. Computational Biology, vol. 1, no. 4, pp. 337-348, 1994.
[16] M. Brudno et al., "LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA," Genome Research, vol. 13, no. 4, pp. 721-731, 2003.
[17] S. Yamada, O. Gotoh, and H. Yamana, "Improvement in Accuracy of Multiple Sequence Alignment Using Novel Group-to-Group Sequence Alignment Algorithm with Piecewise Linear Gap Cost," BMC Bioinformatics, vol. 7, no. 1, article 524, 2006.
[18] O. Gotoh, "Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinement as Assessed by Reference to Structural Alignments," J. Molecular Biology, vol. 264, no. 4, pp. 823-838, 1996.
[19] W. Bains, "MULTAN: A Program to Align Multiple DNA Sequences," Nucleic Acids Research, vol. 14, no. 1, pp. 159-177, 1986.
[20] J. Pei and N. Grishin, "PROMALS: Towards Accurate Multiple Sequence Alignments of Distantly Related Proteins," Bioinformatics, vol. 23, no. 7, pp. 802-808, 2007.
[21] J. Pei, R. Sadreyev, and N. Grishin, "PCMA: Fast and Accurate Multiple Sequence Alignment Based on Profile Consistency," Bioinformatics, vol. 19, no. 3, pp. 427-428, 2003.
[22] J. Papadopoulos and R. Agarwala, "COBALT: Constraint-Based Alignment Tool for Multiple Protein Sequences," Bioinformatics, vol. 23, no. 9, pp. 1073-1079, 2007.
[23] J.D. Thompson, F. Plewniak, J. Thierry, and O. Poch, "DbClustal: Rapid and Reliable Global Multiple Alignments of Protein Sequences Detected by Database Searches," Nucleic Acids Research, vol. 28, no. 15, pp. 2919-2926, 2000.
[24] L. Guasco, "Multiple Sequence Alignment Correction Using Constraints," PhD dissertation, Universidade Nova de Lisboa, 2010.
[25] R. Gouveia-Oliveira and A. Pedersen, "Finding Coevolving Amino Acid Residues Using Row and Column Weighting of Mutual Information and Multi-Dimensional Amino Acid Representation," Algorithms for Molecular Biology, vol. 1, article 12, 2007.
[26] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo, "Fast Discovery of Association Rules," Proc. Advances in Knowledge Discovery and Data Mining Conf., pp. 307-328, 1996.
[27] W. Taylor, "The Classification of Amino Acid Conservation," J. Theoretical Biology, vol. 119, pp. 205-218, 1986.
[28] G. Nemhauser, L. Wolsey, and M. Fisher, "An Analysis of Approximations for Maximizing Submodular Set Functions-I," Math. Programming, vol. 14, pp. 265-294, 1978.
[29] J. Edmonds and R.M. Karp, "Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems," J. ACM, vol. 19, pp. 248-264, Apr. 1972.
[30] A. Goldberg and R. Tarjan, "A New Approach to the Maximum-Flow Problem," J. ACM, vol. 35, pp. 921-940, 1988.
[31] B. Morgenstern, K. Frech, A. Dress, and T. Werner, "DIALIGN: Finding Local Similarities by Multiple Sequence Alignment," Bioinformatics, vol. 14, no. 3, pp. 290-294, 1998.
[32] J. Thompson, P. Koehl, R. Ripp, and O. Poch, "BAliBASE 3.0: Latest Developments of the Multiple Sequence Alignment Benchmark," Proteins: Structure, Function, and Bioinformatics, vol. 61, no. 1, pp. 127-136, 2005.
[33] G.P.S. Raghava, S. Searle, P. Audley, J. Barber, and G. Barton, "OXBench: A Benchmark for Evaluation of Protein Multiple Sequence Alignment Accuracy," BMC Bioinformatics, vol. 4, no. 1, article 47, 2003.
[34] R. Edgar, "MSA Benchmark Collection," http://www.drive5. combench/, 2013.
[35] W.K. Kroeze, D.J. Sheffler, and B.L. Roth, "G-Protein-Coupled Receptors at a Glance," J. Cell Science, vol. 116, pp. 4867-4869, 2003.
[36] J. Sauder, J. Arthur, and R. Dunbrack, "Large-Scale Comparison of Protein Sequence Alignment Algorithms with Structure Alignments," Proteins, vol. 40, pp. 6-22, 2000.
[37] M. Cline, R. Hughey, and K. Karplus, "Predicting Reliable Regions in Protein Sequence Alignments," Bioinformatics, vol. 18, no. 2, pp. 306-314, 2002.
[38] J. Thomas, N. Ramakrishnan, and C. Bailey-Kellogg, "Graphical Models of Residue Coupling in Protein Families," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 5, no. 2, pp. 183-197, Apr.-June 2008.
[39] R. Ranganathan, "Statistical Coupling Analysis," http://systems.swmed.edu/rr_labsca.html, 2013.
[40] D. Marks, L. Colwell, R. Sheridan, T. Hopf, A. Pagnani, R. Zecchina, and C. Sander, "Protein 3D Structure Computed from Evolutionary Sequence Variation," PLoS ONE, vol. 6, no. 12, article e28766, 2011.
[41] S.W. Lockless and R. Ranganathan, "Evolutionarily Conserved Pathways of Energetic Connectivity in Protein Families," Science, vol. 286, no. 5438, pp. 295-299, 1999.
[42] N. Halabi, O. Rivoire, S. Leibler, and R. Ranganathan, "Protein Sectors: Evolutionary Units of Three-Dimensional Structure," Cell, vol. 138, pp. 774-786, 2009.
[43] A. Waterhouse, J. Procter, D. Martin, M. Clamp, and G. Barton, "Jalview Version 2—A Multiple Sequence Alignment Editor and Aanalysis Workbench," Bioinformatics, vol. 25, no. 9, pp. 1189-1191, 2009.
[44] S. Balakrishnan, H. Kamisetty, J. Carbonell, S. Lee, and C. Langmead, "Learning Generative Models for Protein Fold Families," Proteins: Structure, Function, and Bioinformatics, vol. 79, no. 4, pp. 1061-1078, 2011.
[45] J. Thomas, N. Ramakrishnan, and C. Bailey-Kellogg, "Protein Design by Sampling an Undirected Graphical Model of Residue Constraints," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 6, no. 3, pp. 506-516, July-Sept. 2009.
[46] T. Hopf, L. Colwell, R. Sheridan, B. Rost, C. Sander, and D. Marks, "Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing," Cell, vol. 149, no. 7, pp. 1607-1621, 2012.
87 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool