Subscribe
Issue No.02 - April-June (2008 vol.5)
pp: 313-318
ABSTRACT
Emerging microarray technologies allow affordable typing of very long genome sequences. A key challenge in analyzing of such huge amount of data is scalable and accurate computational inferring of haplotypes (i.e., splitting of each genotype into a pair of corresponding haplotypes). In this paper, we first phase genotypes consisting only of two SNPs using genotypes frequencies adjusted to the random mating model and then extend phasing of two-SNP genotypes to phasing of complete genotypes using maximum spanning trees. Runtime of the proposed 2SNP algorithm is $O(nm (n + \log m)$, where n and m are the numbers of genotypes and SNPs, respectively, and it can handle genotypes spanning entire chromosomes in a matter of hours.On datasets across 23 chromosomal regions from HapMap[11], 2SNP is several orders of magnitude faster than GERBIL and PHASE while matching them in quality measured by the number of correctly phased genotypes, single-site and switching errors. For example the 2SNP software phases entire chromosome ($10^5$ SNPs from HapMap) for 30 individuals in 2 hours with average switching error 7.7%.We have also enhanced 2SNP algorithm to phase family trio data and compared it with four other well-known phasing methods on simulated data from [15]. 2SNP is much faster than all of them while loosing in quality only to PHASE. 2SNP software is publicly available at http://alla.cs.gsu.edu/~software/2SNP.
INDEX TERMS
SNP, genotype, haplotype, phasing, algorithm
CITATION
Dumitru Brinza, Alexander Zelikovsky, "2SNP: Scalable Phasing Method for Trios and Unrelated Individuals", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.5, no. 2, pp. 313-318, April-June 2008, doi:10.1109/TCBB.2007.1068
REFERENCES
 [1] Affymetrix, http://www.affymetrix.com/productsarrays /, 2005. [2] D. Brinza and A. Zelikovsky, “2SNP: Scalable Phasing Based on 2-SNP Haplotypes,” Bioinformatics, vol. 22, no. 3, pp. 371-374, 2006. [3] D. Brinza and A. Zelikovsky, “Phasing of 2-SNP Genotypes Based on Non-Random Mating Model,” Proc. Int'l Workshop Bioinformatics Research and Applications, pp. 767-774, 2006. [4] A. Clark, “Inference of Haplotypes from PCR-Amplified Samples of Diploid Populations,” Molecular Biology and Evolution, vol. 7, pp.111-122, 1990. [5] M. Daly, J. Rioux, S. Schaffner, T. Hudson, and E. Lander, “High Resolution Haplotype Structure in the Human Genome,” Nature Genetics, vol. 29, pp. 229-232, 2001. [6] G. Gabriel, S. Schaffner, H. Nguyen, J. Moore, J. Roy, B. Blumenstiel, J. Higgins et al., “The Structure of Haplotype Blocks in the Human Genome,” Science, vol. 296, pp. 2225-2229, 2002. [7] D. Gusfield, “Haplotype Inference by Pure Parsimony,” Proc. Symp. Combinatorial Pattern Matching, pp. 144-155, 2003. [8] E. Halperin, and E. Eskin, “Haplotype Reconstruction from Genotype Data Using Imperfect Phylogeny,” Bioinformatics, vol. 20, pp. 1842-1849, 2004. [9] R. Hudson, “Gene Genealogies and the Coalescent Process,” Oxford Survey of Evolutionary Biology, vol. 7, pp. 1-44, 1990. [10] J. Hull, K. Rowlands, E. Lockhart, M. Sharland, C. Moore, N. Hanchard, and D.P. Kwiatkowski, “Haplotype Mapping of the Bronchiolitis Susceptibility Locus Near IL8,” Am. J. Human Genetics, vol. 114, pp. 272-279, 2004. [11] Int'l HapMap Consortium, “The Int'l HapMap Project,” Nature, vol. 426, pp. 789-796, , 2003. [12] G. Kimmel and R. Shamir, “GERBIL: Genotype Resolution and Block Identification Using Likelihood,” Proc Nat'l Academy of Sciences, vol. 102, pp. 158-162, 2005. [13] L. Kruglyak and D.A. Nickerson, “Variation Is the Spice of Life,” Nature Genetics, vol. 27, pp. 234-236, 2001. [14] S. Lin, A. Chakravarti, and D. Cutler, “Haplotype and Missing Data Inference in Nuclear Families,” Genome Research, vol. 14, pp.1624-1632, 2004. [15] J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z.S. Qin, H.M. Munro, G.R. Abecasis, P. Donnelly, and Int'l HapMap Consortium, “, A Comparison of Phasing Algorithms for Trios and Unrelated Individuals,” Am. J. Human Genetics, vol. 78, pp. 437-450, 2006. [16] T. Niu, Z. Qin, X. Xu, and J.S. Liu, “Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms,” Am. J. Human Genetics, vol. 70, pp. 157-169, 2002. [17] T. Niu, “Algorithms for Inferring Haplotypes,” Genetic Epidemiology, vol. 27, no. 4, pp. 334-347, 2004. [18] Phasing Algorithm Benchmark Datasets, http://www.hapmap.orghttp://www.stats.ox.ac.uk/ marchiniphaseoff.html, July 2006. [19] S. Schaffner, C. Foo, S. Gabriel, D. Reich, M. Daly, and D. Altshuler, “Calibrating a Coalescent Simulation of Human Genome Sequence Variation,” Genome Research, vol. 15, pp. 1576-1583, 2005. [20] M. Stephens, N. Smith, and P. Donnelly, “A New Statistical Method for Haplotype Reconstruction from Population Data,” Am. J. Human Genetics, vol. 68, pp. 978-989, 2001. [21] M. Stephens and P. Donnelly, “A Comparison of Bayesian Methods for Haplotype Reconstruction from Population Genotype Data,” Am. J. Human Genetics, vol. 73, pp. 1162-1169, 2003.