This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Maximum-Scoring Segment Sets
October-December 2004 (vol. 1 no. 4)
pp. 139-150

Abstract—We examine the problem of finding maximum-scoring sets of disjoint segments in a sequence of scores. The problem arises in DNA and protein segmentation and in postprocessing of sequence alignments. Our key result states a simple recursive relationship between maximum-scoring segment sets. The statement leads to fast algorithms for finding such segment sets. We apply our methods to the identification of noncoding RNA genes in thermophiles.

[1] M. Csürös , “Algorithms for Finding Maximal-Scoring Segment Sets (Extended Abstract),” Algorithms in Bioinformatics, Fourth Workshop, I. Jonassen and J. Kim, eds., pp. 62-73, Heidelberg: Springer-Verlag, 2004.
[2] J. Bentley , “Programming Pearls: Algorithm Design Techniques,” Comm. ACM, vol. 27, no. 9, pp. 865-873, 1984.
[3] J.V. Braun and H.-G. Müller , “Statistical Methods for DNA Sequence Segmentation,” Statistical Science, vol. 13, no. 2, pp. 142-162, 1998.
[4] S. Karlin and V. Brendel , “Chance and Significance in Protein and DNA Analysis,” Science, vol. 257, no. 5066, pp. 39-49, 1992.
[5] Y.-X. Fu and R.N. Curnow , “Maximum Likelihood Estimation of Multiple Change Points,” Biometrika, vol. 77, no. 3, pp. 563-573, 1990.
[6] W. Li , P. Bernaola-Galván , F. Haghighi , and I. Grosse , “Applications of Recursive Segmentation to the Analysis of DNA Sequences,” Computers and Chemistry, vol. 26, no. 5, pp. 491-510, 2002.
[7] W.L. Ruzzo and M. Tompa , “A Linear Time Algorithm for Finding All Maximal Scoring Subsequences,” Proc. Seventh Int'l Conf. Intelligent Systems in Molecular Biology, pp. 234-241, 1999.
[8] R.J. Klein , Z. Misulovin , and S.R. Eddy , “Noncoding RNA Genes Identified in AT-Rich Hyperthermophiles,” Proc. Nat'l Academy of Sciences, vol. 99, no. 11, pp. 7542-7547, 2002.
[9] G.A. Churchill , “Stochastic Models for Heterogeneous DNA Sequences,” Bull. Math. Biology, vol. 51, no. 1, pp. 79-94, 1989.
[10] Z. Zhang , P. Berman , T. Wiehe , and W. Miller , “Post-Processing Long Pairwise Alignments,” Bioinformatics, vol. 15, no. 12, pp. 1012-1019, 1999.
[11] M. Kearns , Y. Mansour , A.Y. Ng , and D. Ron , “An Experimental and Theoretical Comparison of Model Selection Methods,” Machine Learning, vol. 27, pp. 7-50, 1997.
[12] D.V. Hinkley and E.A. Hinkley , “Inference about the Change-Point in a Sequence of Binomial Variables,” Biometrika, vol. 57, no. 3, pp. 477-488, 1970.
[13] J. Rissanen , “A Universal Prior for Integers and Estimation by Minimum Description Length,” Annals of Statistics, vol. 11, no. 2, pp. 416-431, 1983.
[14] A. Barron , J. Rissanen , and B. Yu , “The Minimum Description Length Principle in Coding and Modeling,” IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2743-2760, 1998.
[15] H. Akaike , “A New Look at Statistical Model Identification,” IEEE Trans. Automatic Control, vol. 19, no. 6, pp. 716-723, 1974.
[16] G. Schwarz , “Estimating the Dimension of a Model,” Annals of Statistics, vol. 6, no. 2, pp. 461-464, 1978.
[17] D.L. Inglehart , “Extreme Values in the G1/G/1 Queue,” Annals of Mathematical Statistics, vol. 43, no. 2, pp. 627-635, 1972.
[18] S. Karlin , A. Dembo , and T. Kawabata , “Statistical Composition of High-Scoring Segments from Molecular Sequences,” Annals of Statistics, vol. 18, no. 2, pp. 571-581, 1990.
[19] S. Karlin and S.F. Altschul , “Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes,” Proc. Nat'l Academy of Sciences, vol. 87, no. 6, pp. 2264-2268, 1990.
[20] L.R. Rabiner , “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[21] S.R. Eddy , “Noncoding RNA Genes and the Modern RNA World,” Nature Reviews Genetics, vol. 2, pp. 919-929, 2001.
[22] P. Schattner , “Searching for RNA Genes Using Base Composition Statistics,” Nucleic Acids Research, vol. 30, no. 9, pp. 2076-2082, 2002.
[23] N. Galtier and J. Lobry , “Relationships between Genomic G+C Content, RNA Secondary Structures, and Optimal Growth Temperature in Prokaryotes,” J. Molecular Evolution, vol. 44, pp. 632-636, 1997.
[24] N. Galtier , N. Tourasse , and M. Gouy , “A Nonhyperthermophilic Common Ancestor to Extant Life Forms,” Science, vol. 283, pp. 220-221 1999
[25] H.-C. Wang and D.A. Hickey , “Evidence for Strong Selective Constraint Acting on the Nucleotide Composition of 16S Ribosomal RNA Genes,” Nucleic Acids Research, vol. 30, no. 11, pp. 2501-2507, 2002.
[26] Q. Bao , Y. Tian , W. Li , Z. Xu , Z. Xuan , S. Hu , W. Dong , J. Yang , Y. Chen , Y. Xue , Y. Xu , X. Lai , L. Huang , X. Dong , Y. Ma , L. Ling , H. Tan , R. Chen , J. Wang , J. Yu , and H. Yang , “A Complete Sequence of the T. tengcongensis Genome,” Genome Research, vol. 12, no. 5, pp. 689-700, 2002.
[27] N.F.W. Saunders , T. Thomas , P.M.G. Curmi , J.S. Mattick , E. Kuczek , R. Slade , J. Davis , P.D. Franzmann , D. Boone , K. Rusterholtz , R. Feldman , C. Gates , S. Bench , K. Sowers , K. Kadner , A. Aerts , P. Dehal , C. Detter , T. Glavina , S. Lucas , P. Richardson , F. Larimer , L. Hauser , M. Land , and R. Cavicchioli , “Mechanisms of Thermal Adaptation Revealed from the Genomes of the Antarctic Archaea Methanogenium frigidum and Methanococcoides burtonii,” Genome Research, vol. 13, no. 7, pp. 1580-1588, 2003.
[28] J.W. Brown , “The Ribonuclease P Database,” Nucleic Acids Research, vol. 27, no. 1, p. 314, 1999.
[29] T.M. Lowe and S.R. Eddy , “tRNAscan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence,” Nucleic Acids Research, vol. 25, no. 5, pp. 955–964, 1997.
[30] E. Waters , M.J. Hohn , I. Ahel , D.E. Graham , M.D. Adams , M. Barnstead , K.Y. Beeson , L. Bibbs , R. Bolanos , M. Keller , K. Kretz , X. Lin , E. Mathur , J. Ni , M. Podar , T. Richardson , G.G. Sutton , M. Simon , D. Soll , K.O. Stetter , J.M. Short , and M. Noordewier , “The Genome of Nanoarchaeum equitans: Insights into Early Archaeal Evolution and Derived Parasitism,” Proc. Nat'l Academy of Sciences, vol. 100, no. 22, pp. 12,984-12,988, 2003.
[31] Y. Kawarabayashi , Y. Hino , H. Horikawa , K. Jin-no , M. Takahashi , M. Sekine , S. Baba , A. Ankai , H. Kosugi , A. Hosoyama , S. Fukui , Y. Nagai , K. Nishijima , R. Otsuka , H. Nakazawa , M. Takamiya , Y. Kato , T. Yoshizawa , T. Tanaka , Y. Kudoh , J. Yamazaki , N. Kushida , A. Oguchi , K. Aoki , S. Masuda , M. Yanagii , M. Nishimura , A. Yamagishi , T. Oshima , and H. Kikuchi , “Complete Genome Sequence of an Aerobic Thermoacidophilic Crenarchaeon, Sulfolobus tokodaii Strain7,” DNA Research, vol. 8, no. 4, pp. 123-140, 2001.
[32] M. Alm Rosenblad , J. Gorodkin , B. Knudsen , C. Zwieb , and T. Samuelsson , “SRPDB (Signal Recognition Particle Database),” Nucleic Acids Research, vol. 31, pp. 363-364, 2003.
[33] T.R. Bement and M.S. Waterman , “Locating Maximum Variance Segments in Sequential Data,” Math. Geology, vol. 9, no. 1, pp. 55-61, 1977.
[34] I.E. Auger and C.E. Lawrence , “Algorithms for the Optimal Identification of Segment Neighborhoods,” Bull. Math. Biology, vol. 51, no. 1, pp. 39-54, 1989.
[35] X. Huang , “An Algorithm for Identifying Regions of a DNA Sequence that Satisfy a Content Requirement,” Computer Applications in the Biosciences, vol. 10, pp. 219-225, 1994.
[36] Y.-L. Lin , T. Jiang , and K.-M. Chao , “Efficient Algorithms for Locating the Length-Constrained Heaviest Segments with Applications to Biomolecular Sequence Analysis,” J. Computer and System Sciences, vol. 65, pp. 570-586, 2002.
[37] Y.-L. Lin , X. Huang , T. Jiang , and K.-M. Chao , “MAVG: Locating Non-Overlapping Maximum Average Segments in a Given Sequence,” Bioinformatics, vol. 19, no. 1, pp. 151-152, 2003.
[38] T.F. Smith and M.S. Waterman , “Identification of Common Molecular Subsequences” J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[39] A.N. Arslan , O. Eğecioğlu , and P.A. Pevzner , “A New Approach to Sequence Comparison: Normalized Sequence Alignment,” Bioinformatics, vol. 17, no. 4, pp. 327-337, 2001.

Index Terms:
Segmentation, change point estimation, noncoding RNA, thermophiles.
Citation:
Mikl? Csur?, "Maximum-Scoring Segment Sets," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 1, no. 4, pp. 139-150, Oct.-Dec. 2004, doi:10.1109/TCBB.2004.43
Usage of this product signifies your acceptance of the Terms of Use.