This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
[Front cover]
July-September 2007 (vol. 4 no. 3)
pp. c1
It has been shown that electropherograms of DNA sequences can be modeled with hidden Markov models. Basecalling, the procedure that determines the sequence of bases from the given electropherogram can then be performed using the Viterbi algorithm. A training step is required prior to basecalling in order to estimate the HMM parameters. In this paper, we propose a Bayesian approach which employs the Markov chain Monte Carlo (MCMC) method to perform basecalling. Such an approach not only allows one to naturally encode the prior biological knowledge into the basecalling algorithm, it also exploits both the training data and the basecalling data in estimating the HMM parameters, leading to more accurate estimates. Using the recently sequenced genome of the organism Legionella pneumophila, we show that the MCMC basecaller outperforms the state-of-the-art basecalling algorithm in terms of total errors while requiring much less training than other proposed statistical basecallers.

[1] C. Andrieu, J. Freitas, and A. Doucet, “Robust Full Bayesian Learning for Neural Networks,” Neural Computing, vol. 13, pp.2359-2407, 2001.
[2] Applied Biosystems. “Applied Biosystems 3730/3730xl DNA Analyzers: Sequencing Chemistry Guide: Rev B,” http:/www.appliedbiosystems.com/, 2002.
[3] P. Boufounos, S. El-Difrawy, and D. Ehrlich, “Basecalling Using Hidden Markov Models,” J. Franklin Inst., vol. 341, pp. 23-36, 2004.
[4] G. Casella and E.I. George, “Explaining the Gibbs Sampler,” The Am. Statistician, vol. 46, pp. 167-174, Aug. 1992.
[5] M. Chien et al., “The Genomic Sequence of the Accidental Pathogen Legionella Pneumophila,” Science, vol. 305, pp. 1966-1968, 2004.
[6] R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1999.
[7] B. Alberts et al., Essential Cell Biology. Garland Science, 2003.
[8] B. Ewing, L. Hillier, M.C. Wendl, and P. Green, “Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assement,” Genome Research, vol. 8, pp. 175-185, 1998.
[9] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, Markov Chain Monte Carlo in Practice. Chapman & Hall, 1996.
[10] N. Haan and S.J. Godsill, “Modelling Electropherogram Data for DNA Sequencing Using Variable Dimension,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '00), vol. 6, pp. 3542-3545, June 2000.
[11] N. Haan and S.J. Godsill, “Sequential Methods for DNA Sequencing,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '01), vol. 2, pp. 1045-1048, May 2001.
[12] Technelysium Pty Ltd., “Chromas,” http://www.technelysium. com.auchromas.html , 2004.
[13] R.E. Mahony, G.D. Brushe, and J.B. Moore, “Hybrid Algorithms for Maximum Likelihood and Maximum A Posteriori Sequence Estimation,” Proc. Int'l Conf. Signal Processing and Applications (ISSPA '96), 1996.
[14] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Application in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp.257-285, Feb. 1989.
[15] F. Sanger, S. Nicklen, and A.R. Coulson, “DNA Sequencing with Chain-Terminating Inhibitors,” Proc. Nat'l Academy of Sciences USA, vol. 74, pp. 5463-5467, 1977.
[16] T. Vercauteren, A. Lopez, and X. Wang, “Estimating the Number of Competing Terminals in an IEEE 802.11 Wireless Network,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '05), Mar. 2005.
[17] X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, and E.R. Dougherty, “A Bayesian Connectivity-Based Approach to Constructing Probabilistic Gene Regulatory Networks,” Bioinformatics, vol. 20, no. 17, pp. 2918-2927, Nov. 2004.

Index Terms:
Viterbi decoding,Bayes methods,biological techniques,biology computing,DNA,fluorescence,genetics,hidden Markov models,microorganisms,molecular biophysics,Monte Carlo methods,Bayesian basecalling algorithm,Legionella pneumophila,sequenced genome,prior biological knowledge encoding,Markov chain Monte Carlo method,Bayesian approach,Viterbi algorithm,electropherograms,hidden Markov models,DNA sequence analysis
Citation:
"[Front cover]," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. c1, July-Sept. 2007, doi:10.1109/tcbb.2007.1027
Usage of this product signifies your acceptance of the Terms of Use.