This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Multiple Sequence Assembly from Reads Alignable to a Common Reference Genome
September/October 2011 (vol. 8 no. 5)
pp. 1283-1295
Qian Peng, University of California, San Diego, La Jolla
Andrew D. Smith, University of Southern California, Los Angeles
We describe a set of computational problems motivated by certain analysis tasks in genome resequencing. These are assembly problems for which multiple distinct sequences must be assembled, but where the relative positions of reads to be assembled are already known. This information is obtained from a common reference genome and is characteristic of resequencing experiments. The simplest variant of the problem aims at determining a minimum set of superstrings such that each sequenced read matches at least one superstring. We give an algorithm with time complexity O(N), where N is the sum of the lengths of reads, substantially improving on previous algorithms for solving the same problem. We also examine the problem of finding the smallest number of reads to remove such that the remaining reads are consistent with k superstrings. By exploiting a surprising relationship with the minimum cost flow problem, we show that this problem can be solved in polynomial time when nested reads are excluded. If nested reads are permitted, this problem of removing the minimum number of reads becomes NP-hard. We show that permitting mismatches between reads and their nearest superstrings generally renders these problems NP-hard.

[1] V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph, “Haplotyping as Perfect Phylogeny: A Direct Approach,” J. Computational Biology, vol. 10, nos. 3/4, pp. 323-340, 2003.
[2] V. Bafna, S. Istrail, G. Lancia, and R. Rizzi, “Polynomial and APX-Hard Cases of the Individual Haplotyping Problem,” Theoretical Computer Science, vol. 335, no. 1, pp. 109-125, 2005.
[3] V. Bansal, A.L. Halpern, N. Axelrod, and V. Bafna, “An MCMC Algorithm for Haplotype Assembly from Whole-Genome Sequence Data,” Genome Research, vol. 18, no. 8, pp. 1336-1346, 2008.
[4] L.G. Biesecker et al., “The Clinseq Project: Piloting Large-Scale Genome Sequencing for Research in Genomic Medicine,” Genome Research, vol. 19, no. 9, pp. 1665-1674, 2009.
[5] R. Cilibrasi, L. van Iersel, S. Kelk, and J. Tromp, “The Complexity of the Single Individual SNP Haplotyping Problem,” Algorithmica, vol. 49, no. 1, pp. 13-36, Sept. 2007.
[6] S.A. Cook and R.A. Reckhow, “Time Bounded Random Access Machines,” J. Computer and System Sciences, vol. 7, no. 4, pp. 354-375, 1973.
[7] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms. MIT Press, 2001.
[8] R.P. Dilworth, “A Decomposition Theorem for Partially Ordered Sets,” The Annals of Math., vol. 51, no. 1, pp. 161-166, 1950.
[9] J. Edmonds and R.M. Karp, “Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems,” J. ACM, vol. 19, no. 2, pp. 248-264, 1972.
[10] N. Eriksson, L. Pachter, Y. Mitsuya, S.-Y. Rhee, C. Wang, B. Gharizadeh, M. Ronaghi, R.W. Shafer, and N. Beerenwinkel, “Viral Population Estimation Using Pyrosequencing,” PLoS Computational Biology, vol. 4, no. 5, p. e1000074, May 2008.
[11] E. Eskin, E. Halperin, and R. Karp, “Efficient Reconstruction of Haplotype Structure via Perfect Phylogeny,” J. Bioinformatics Computational Biology, vol. 1, no. 1, pp. 1-20, 2003.
[12] S. Felsner, V. Raghavan, and J. Spinrad, “Recognition Algorithms for Orders of Small Width and Graphs of Small Dilworth Number,” Order, vol. 20, no. 4, pp. 351-364, Nov. 2003.
[13] L.R. Ford and D.R. Fulkerson, “Maximal Flow through a Network,” Canadian J. Math., vol. 8, no. 3, pp. 399-404, 1956.
[14] L.R. Ford and D.R. Fulkerson, “Constructing Maximal Dynamic Flows from Static Flows,” Operations Research, vol. 6, no. 3, pp. 419-433, 1958.
[15] L.R. Ford and D.R. Fulkerson, Flows in Networks. Princeton Univ. Press, 1962.
[16] M. Frances and A. Litman, “On Covering Problems of Codes,” Theory of Computing Systems, vol. 30, no. 2, pp. 113-119, Mar. 1997.
[17] A. Frank, “On Chain and Antichain Families of a Partially Ordered Set,” J. Combinatorial Theory, Series B, vol. 29, no. 2, pp. 176-184, Oct. 1980.
[18] K.A. Frazer et al., “A Second Generation Human Haplotype Map of over 3.1 Million SNPs,” Nature, vol. 449, no. 7164, pp. 851-861, 2007.
[19] E. Fredkin, “Trie Memory,” Comm. ACM, vol. 3, no. 9, pp. 490-499, 1960.
[20] D.R. Fulkerson, “Note on Dilworth's Decomposition Theorem for Partially Ordered Sets,” Proc. Am. Math. Soc., vol. 7, no. 4, pp. 701-702, Aug. 1956.
[21] C.W. Fuller et al., “The Challenges of Sequencing by Synthesis,” Nature Biotechnology, vol. 27, no. 11, pp. 1013-1023, 2009.
[22] H.N. Gabow and R.E. Tarjan, “Faster Scaling Algorithms for Network Problems,” SIAM J. Computing, vol. 18, pp. 1013-1036, 1989.
[23] S.R. Gill, M. Pop, R.T. DeBoy, P.B. Eckburg, P.J. Turnbaugh, B.S. Samuel, J.I. Gordon, D.A. Relman, C.M. Fraser-Liggett, and K.E. Nelson, “Metagenomic Analysis of the Human Distal Gut Microbiome,” Science, vol. 312, no. 5778, pp. 1355-1359, 2006.
[24] C. Green and D. Kleitman, “The Structure of Sperner K-Family,” J. Combinatorial Theory (A), vol. 20, pp. 80-88, 1976.
[25] C. Greene, “Some Partitions Associated with a Partially Ordered Set,” J. Combinatorial Theory (A), vol. 20, no. 1, pp. 69-79, 1976.
[26] D. Gusfield, “Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions,” Proc. Sixth Ann. Int'l Conf. Computational Biology (RECOMB '02), pp. 166-175, 2002.
[27] M. Jakobsson, S.W. Scholz, P. Scheet, J.R. Gibbs, J.M. VanLiere, H.-C. Fung, Z.A. Szpiech, J.H. Degnan, K. Wang, R. Guerreiro, J.M. Bras, J.C. Schymick, D.G. Hernandez, B.J. Traynor, J. Simon-Sanchez, M. Matarin, A. Britton, J. van de Leemput, I. Rafferty, M. Bucan, H.M. Cann, J.A. Hardy, N.A. Rosenberg, and A.B. Singleton, “Genotype, Haplotype and Copy-Number Variation in Worldwide Human Populations,” Nature, vol. 451, no. 7181, pp. 998-1003, 2008.
[28] J.Y. Kim, S. Tavaré, and D. Shibata, “Counting Human Somatic Cell Replications: Methylation Mirrors Endometrial Stem Cell Divisions,” Proc. Nat'l Academy of Sciences USA, vol. 102, no. 49, pp. 17739-17744, 2005.
[29] H.R. Kobel and L. Du Pasquier, “Genetics of Polyploid Xenopus,” Trends in Genetics, vol. 2, pp. 310-315, 1986.
[30] P.W. Laird, “The Power and the Promise of DNA Methylation Markers,” Nature Rev. Cancer, vol. 3, no. 4, pp. 253-266, 2003.
[31] G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz, “SNPs Problems, Complexity and Algorithms,” Proc. Ann. European Symp. Algorithms (ESA), F.M. auf der Heide, ed., pp. 182-193, 2001.
[32] S. Levy, G. Sutton, P.C. Ng, L. Feuk, A.L. Halpern, B.P. Walenz, N. Axelrod, J. Huang, E.F. Kirkness, G. Denisov, Y. Lin, J.R. MacDonald, A.W.C. Pang, M. Shago, T.B. Stockwell, A. Tsiamouri, V. Bafna, V. Bansal, S.A. Kravitz, D.A. Busam, K.Y. Beeson, T.C. McIntosh, K.A. Remington, J.F. Abril, J. Gill, J. Borman, Y.-H. Rogers, M.E. Frazier, S.W. Scherer, R.L. Strausberg, and J.C. Venter, “The Diploid Genome Sequence of an Individual Human,” PLoS Biology, vol. 5, no. 10, p. e254, Sept. 2007.
[33] L.M. Li, J.H. Kim, and M.S. Waterman, “Haplotype Reconstruction from SNP Alignment,” J. Computational Biology, vol. 11, nos. 2/3, pp. 505-516, 2004.
[34] A. Ludwig, N.M. Belfiore, C. Pitra, V. Svirsky, and I. Jenneckens, “Genome Duplication Events and Functional Reduction of Ploidy Levels in Sturgeon (Acipenser, Huso and Scaphirhynchus),” Genetics, vol. 158, no. 3, pp. 1203-1215, 2001.
[35] L.A. Meyers and D.A. Levin, “On the Abundance of Polyploids in Flowering Plants,” Evolution, vol. 60, no. 6, pp. 1198-1206, 2006.
[36] A. Mortazavi, B.A. Williams, K. McCue, L. Schaeffer, and B. Wold, “Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq,” Nature Methods, vol. 5, no. 7, pp. 621-628, 2008.
[37] S. Ohno, Evolution by Gene Duplication. Springer-Verlag, 1970.
[38] P.A. Pevzner, H. Tang, and M.S. Waterman, “An Eulerian Path Approach to DNA Fragment Assembly,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 17, pp. 9748-9753, 2001.
[39] D.E. Schones and K. Zhao, “Genome-Wide Approaches to Studying Chromatin Modifications,” Nature Rev. Genetics, vol. 9, no. 3, pp. 179-191, 2008.
[40] D.C. Schwartz and M.S. Waterman, “New Generations: Sequencing Machines and Their Computational Challenges,” J. Computer Science and Technology, vol. 25, no. 1, pp. 3-9, 2010.
[41] J. Shendure and H. Ji, “Next-Generation DNA Sequencing,” Nature Biotechnology, vol. 26, no. 10, pp. 1135-1145, Oct. 2008.
[42] D. Shibata and S. Tavaré, “Counting Divisions in a Human Somatic Cell Tree: How, What and Why,” Cell Cycle, vol. 5, no. 6, pp. 610-614, 2006.
[43] S.G. Tringe and E.M. Rubin, “Metagenomics: DNA Sequencing of Environmental Samples,” Nature Rev. Genetics, vol. 6, no. 11, pp. 805-814, 2005.
[44] S.G. Tringe et al., “Comparative Metagenomics of Microbial Communities,” Science, vol. 308, no. 5721, pp. 554-557, 2005.
[45] J.A. Udall and J.F. Wendel, “Polyploidy and Crop Improvement,” Crop Science, vol. 46, no. 1, pp. S3-S14, 2006.
[46] Y. Yatabe, S. Tavaré, and D. Shibata, “Investigating Stem Cells in Human Colon by Using Methylation Patterns,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 19, pp. 10839-10844, 2001.

Index Terms:
Combinatorics, sequence assembly, haplotyping, chain and antichain, superstring.
Citation:
Qian Peng, Andrew D. Smith, "Multiple Sequence Assembly from Reads Alignable to a Common Reference Genome," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 5, pp. 1283-1295, Sept.-Oct. 2011, doi:10.1109/TCBB.2010.107
Usage of this product signifies your acceptance of the Terms of Use.