Issue No.01 - January-February (2011 vol.8)
pp: 36-44
Dario Gasbarra , University of Helsinki, Helsinki
Sangita Kulathinal , University of Helsinki, Helsinki and Indic Society for Education and Development, Nashik
Matti Pirinen , University of Helsinki, Helsinki
Mikko J. Sillanpää , University of Helsinki, Helsinki
We assume that allele frequency data have been extracted from several large DNA pools, each containing genetic material of up to hundreds of sampled individuals. Our goal is to estimate the haplotype frequencies among the sampled individuals by combining the pooled allele frequency data with prior knowledge about the set of possible haplotypes. Such prior information can be obtained, for example, from a database such as HapMap. We present a Bayesian haplotyping method for pooled DNA based on a continuous approximation of the multinomial distribution. The proposed method is applicable when the sizes of the DNA pools and/or the number of considered loci exceed the limits of several earlier methods. In the example analyses, the proposed model clearly outperforms a deterministic greedy algorithm on real data from the HapMap database. With a small number of loci, the performance of the proposed method is similar to that of an EM-algorithm, which uses a multinormal approximation for the pooled allele frequencies, but which does not utilize prior information about the haplotypes. The method has been implemented using Matlab and the code is available upon request from the authors.
DNA pools, haplotype frequency estimation, HapMap database, multinomial distribution.
Dario Gasbarra, Sangita Kulathinal, Matti Pirinen, Mikko J. Sillanpää, "Estimating Haplotype Frequencies by Combining Data from Large DNA Pools with Database Information", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 1, pp. 36-44, January-February 2011, doi:10.1109/TCBB.2009.71
[1] N. Arnheim, P. Calabrese, and M. Nordborg, "Hot and Cold Spots of Recombination in the Human Genome: The Reason We Should Find Them and How This Can Be Achieved," Am. J. Human Genetics, vol. 73, pp. 5-16, 2003.
[2] M. Boehnke, K. Lange, and D. Cox, "Statistical Methods for Multipoint Radiation Hybrid Mapping," Am. J. Human Genetics, vol. 49, pp. 1174-1188, 1991.
[3] L.M. Butcher, E. Meaburn, L. Liu, C. Fernandez, L. Hill, A. AL-Chalabi, R. Plomin, L. Schalkwyk, and I.W. Craig, "Genotyping Pooled DNA on Microarrays: A Systematic Genome Screen of Thousands of SNPs in Large Samples to Detect QTLs for Complex Traits," Behavioral Genetics, vol. 34, pp. 549-555, 2004.
[4] S. Chib and E. Greenberg, "Understanding the Metropolis-Hastings Algorithm," Am. Statistician, vol. 49, pp. 327-335, 1995.
[5] A.G. Clark, "Inference of Haplotypes from PCR-Amplified Samples of Diploid Populations," Molecular Biology and Evolution, vol. 7, pp. 111-122, 1990.
[6] R.G. Cowell, S.L. Lauritzen, and J. Mortera, "Identification and Separation of DNA Mixtures Using Peak Area Information," Forensic Science Int'l, vol. 166, pp. 28-34, 2007.
[7] M. Cullen, S.P. Perfetto, W. Klitz, G. Nelson, and M. Garrington, "High-Resolution Patterns of Meiotic Recombination Across the Human Major Histocompatibility Complex," Am. J. Human Genetics, vol. 71, pp. 759-776, 2002.
[8] J.A. Douglas, M. Boehnke, E. Gillanders, J.M. Trent, and S.B. Gruber, "Experimentally-Derived Haplotypes Substantially Increase the Efficiency of Linkage Disequilibirium Studies," Nature Genetics, vol. 28, pp. 361-364, 2001.
[9] L. Excoffier and M. Slatkin, "Maximum-Likelihood Estimation of Molecular Haplotype Frequencies in a Diploid Population," Molecular Biology and Evolution, vol. 12, pp. 921-927, 1995.
[10] M.A.T. Figueiredo, "Adaptive Sparseness for Supervised Learning," IEEE Trans. Pattern Analysis and Machine Intelligence , vol. 25, no. 9 pp. 1150-1159, Sept. 2003.
[11] K. Fukuda, "CDD/CDD+ Reference Manual," ftp://ftp.ifor.math., 1999.
[12] G. Gao, I. Hoeschele, P. Sorensen, and F.X. Du, "Conditional Probability Methods for Haplotyping in Pedigrees," Genetics, vol. 167, pp. 2055-2065, 2004.
[13] D. Gasbarra and M.J. Sillanpää, "Constructing Parental Linkage Phase and Genetic Map Over Distances < 1 cM Using Pooled Haploid DNA," Genetics, vol. 172, pp. 1325-1335, 2006.
[14] W.K. Hastings, "Monte Carlo Sampling Methods Using Markov Chains and Their Applications," Biometrika, vol. 57, pp. 97-109, 1970.
[15] M.E. Hawley and K.K. Kidd, "Haplo: A Program Using the EM Algorithm to Estimate the Frequencies of Multi-Site Haplotypes," J. Heredity, vol. 86, pp. 409-411, 1995.
[16] F. Hoti and M.J. Sillanpää, "Bayesian Mapping of Genotype X Expression Interactions in Quantitative and Qualitative Traits," J. Heredity, vol. 97, pp. 4-18, 2006.
[17] Int'l HapMap Consortium, "A Second Generation Human Haplotype Map of Over 3.1 Million SNPs," Nature, vol. 449, pp. 851-861, 2007.
[18] T. Ito, S. Chiku, E. Inoue, M. Tomita, T. Morisaki, H. Morisaki, and N. Kamatani, "Estimation of Haplotype Frequencies, Linkage-Disequilibrium Measures, and Combination of Haplotype Copies in Each Pool by Use of Pooled DNA Data," Am. J. Human Genetics, vol. 72, pp. 384-398, 2003.
[19] A. Jawaid and P. Sham, "Impact and Quantification of the Sources of Error in DNA Pooling Designs," Annals of Human Genetics, vol. 73, pp. 118-124, 2009.
[20] N.L. Johnson, "An Approximation to the Multinomial Distribution: Some Properties and Applications," Biometrika, vol. 47, pp. 93-102, 1960.
[21] N.L. Johnson and S. Kotz, Distributions in Statistics—Discrete Distributions. Houghton Mifflin Company, 1969.
[22] T. Johnson, "Multipoint Linkage Disequilibrium Mapping Using Multilocus Allele Frequency Data," Annals of Human Genetics, vol. 69, pp. 474-497, 2005.
[23] T. Johnson, "Bayesian Method for Gene Detection and Mapping Using Case and Control Design and DNA Pooling," Biostatistics, vol. 8, pp. 546-565, 2007.
[24] J. Kaipio and E. Somersalo, Statistical and Computational Inverse Problems. Springer, 2004.
[25] B. Kirkpatrick, C.S. Armendariz, R.M. Karp, and E. Halperin, "Haplopool: Improving Haplotype Frequency Estimation through DNA Pools and Phylogenetic Modeling," Bioinformatics, vol. 23, pp. 3048-3055, 2007.
[26] A.Y.C. Kuk, H. Zhang, and Y. Yang, "Computationally Feasible Estimation of Haplotype Frequencies from Grouped DNA with and without Hardy-Weinberg Equilibrium," Bioinformatics, vol. 25, pp. 379-386, 2009.
[27] L.C. Lazzeroni, N. Arnheim, K. Schmitt, and K. Lange, "Multipoint Mapping Calculations for Sperm-Typing Data," Am. J. Human Genetics, vol. 55, pp. 431-436, 1994.
[28] J.C. Long, R.C. Williams, and M. Urbanek, "An E-M Algorithm and Testing Strategy for Multiple-Locus Haplotypes," Am. J. Human Genetics, vol. 56, pp. 799-810, 1995.
[29] L. Lovász and S. Vempala, "Hit-and-Run from a Corner," Siam J. Computing, vol. 35, pp. 985-1005, 2006.
[30] W. Navidi and N. Arnheim, "Analysis of Genetic Data from the Polymerase Chain Reaction," Statistical Science, vol. 9, pp. 320-333, 1994.
[31] W. Navidi and N. Arnheim, "Combining Data from Polymerase Chain Reaction DNA Typing Experiments: Application to Sperm Typing Data," J. Am. Statistical Assoc., vol. 94, pp. 726-733, 1999.
[32] T. Niu, Z.S. Qin, X. Xu, and J.S. Liu, "Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms," Am. J. Human Genetics, vol. 70, pp. 157-169, 2002.
[33] N. Norton, N.M. Williams, H.J. Williams, G. Spurlock, G. Kirov, D.W. Morris, B. Hoogendoorn, M.J. Owen, and M.C. O'Donovan, "Universal, Robust, Highly Quantitative SNP Allele Frequency Measurement in DNA Pools," Human Genetics, vol. 110, pp. 471-478, 2002.
[34] M. Pirinen, S. Kulathinal, D. Gasbarra, and M.J. Sillanpää, "Estimating Population Haplotype Frequencies from Pooled DNA Samples Using PHASE Algorithm," Genetics Research, vol. 90, pp. 509-524, 2008.
[35] D. Qian and L. Beckmann, "Minimum-Recombinant Haplotyping in Pedigrees," Am. J. Human Genetics, vol. 70, pp. 1434-1445, 2002.
[36] S.R.E. Quade, R.C. Elston, and K.A.B. Goddard, "Estimating Haplotype Frequencies in Pooled DNA Samples When There Is Genotyping Error," BMC Genetics, vol. 6, article no. 25, 2005.
[37] P. Sham, J.S. Bader, I. Craig, M. O'Donovan, and M. Owen, "DNA Pooling: A Tool for Large-Scale Association Studies," Nature Rev. Genetics, vol. 3, pp. 862-871, 2002.
[38] D. Slonim, L. Kruglyak, L. Stein, and E. Lander, "Building Human Genome Maps with Radiation Hybrids," J. Computational Biology, vol. 4, pp. 487-504, 1997.
[39] E. Sobel and K. Lange, "Descent Graphs in Pedigree Analysis: Applications to Haplotyping, Location Scores, and Marker-Sharing Statistics," Am. J. Human Genetics, vol. 58, pp. 1323-1337, 1996.
[40] E. Sobel, K. Lange, J.R. O'Connell, and D.E. Weeks, "Haplotyping Algorithms," Genetic Mapping and DNA Sequencing, IMA Volume 81 in Mathematics and Its Applications, T.P. Speed and M.S. Waterman, eds., pp. 89-110, Springer-Verlag, 1996.
[41] M. Stephens and P. Scheet, "Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-Data Imputation," Am. J. Human Genetics, vol. 76, pp. 449-462, 2005.
[42] M. Stephens, N.J. Smith, and P. Donnelly, "A New Statistical Method for Haplotype Reconstruction from Population Data," Am. J. Human Genetics, vol. 68, pp. 978-989, 2001.
[43] G. Tamiya et al., "Whole Genome Association Study of Rheumatoid Arthritis Using 27 039 Microsatellites," Human Molecular Genetics, vol. 14, pp. 2305-2321, 2005.
[44] P. Tapadar, S. Ghosh, and P.P. Majumder, "Haplotyping in Pedigrees via a Genetic Algorithm," Human Heredity, vol. 50, pp. 43-56, 2000.
[45] L.K. Tulsieram, J.C. Glaubitz, G. Kiss, and J.E. Carlson, "Single Tree Genetic Linkage Mapping in Conifers Using Haploid DNA from Megagametophytes," Bio/Technology, vol. 10, pp. 686-690, 1992.
[46] S. Wang, K.K. Kidd, and H. Zhao, "On the Use of DNA Pooling to Estimate Haplotype Frequencies," Genetic Epidemiology, vol. 24, pp. 74-82, 2003.
[47] E. Wijsman, "A Deductive Method of Haplotype Analysis in Pedigrees," Am. J. Human Genetics, vol. 41, pp. 356-373, 1987.
[48] S. Xu, "Estimating Polygenic Effects Using Markers of the Entire Genome," Genetics, vol. 163, pp. 789-801, 2003.
[49] H.-C. Yang, C.-C. Pan, R.C.Y. Lu, and C.S.J. Fann, "New Adjustment Factors and Sample Size Calculation in DNA-Pooling Experiment with Preferential Amplification," Genetics, vol. 169, pp. 399-410, 2005.
[50] Y. Yang, J. Zhang, J. Hoh, F. Matsuda, P. Xu, M. Lathrop, and J. Ott, "Efficiency of Single-Nucleotide Polymorphism Haplotype Estimation from Pooled DNA," Proc. Nat'l Academy of Sciences USA, vol. 100, pp. 7225-7230, 2003.
[51] R.F. Yazdani, C. Yeh, and J. Rimsha, "Genomic Mapping of Pinus Sylvestris (L.) Using Random Amplified Polymorphic DNA Markers," Forest Genetics, vol. 2, pp. 109-116, 1995.
[52] D. Zeng and D.Y. Lin, "Estimating Haplotype-Disease Associations with Pooled Genotype Data," Genetic Epidemiology, vol. 28, pp. 70-82, 2005.
[53] H. Zhang, H.C. Yang, and Y. Yang, "Poool: An Efficient Method for Estimating Haplotype Frequencies from Large DNA Pools," Bioinformatics, vol. 24, pp. 1942-1948, 2008.
[54] Y. Zhao and S. Wang, "Optimal DNA Pooling-Based Two-Stage Designs in Case-Control Association Studies," Human Heredity, vol. 67, pp. 46-56, 2009.