This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
SCS: Signal, Context, and Structure Features for Genome-Wide Human Promoter Recognition
July-September 2010 (vol. 7 no. 3)
pp. 550-562
Jia Zeng, Soochow University, Suzhou
Xiao-Yu Zhao, Hong Kong Polytechnic University, Hong Kong
Xiao-Qin Cao, City University of Hong Kong, Hong Kong
Hong Yan, City University of Hong Kong, Hong Kong and The University of Sydney, Sydney
This paper integrates the signal, context, and structure features for genome-wide human promoter recognition, which is important in improving genome annotation and analyzing transcriptional regulation without experimental supports of ESTs, cDNAs, or mRNAs. First, CpG islands are salient biological signals associated with approximately 50 percent of mammalian promoters. Second, the genomic context of promoters may have biological significance, which is based on n-mers (sequences of n bases long) and their statistics estimated from training samples. Third, sequence-dependent DNA flexibility originates from DNA 3D structures and plays an important role in guiding transcription factors to the target site in promoters. Employing decision trees, we combine above signal, context, and structure features to build a hierarchical promoter recognition system called SCS. Experimental results on controlled data sets and the entire human genome demonstrate that SCS is significantly superior in terms of sensitivity and specificity as compared to other state-of-the-art methods. The SCS promoter recognition system is available online as supplemental materials for academic use and can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.95.

[1] J. Fickett and A. Hatzigeorgiou, "Eukaryotic Promoter Recognition," Genome Research, vol. 7, pp. 861-878, 1997.
[2] U. Ohler and H. Niemann, "Identification and Analysis of Eukaryotic Promoters: Recent Computational Approaches," Trends in Genetics, vol. 17, pp. 56-60, 2001.
[3] T. Werner, "The State of the Art of Mammalian Promoter Recognition," Briefings in Bioinformatics, vol. 4, pp. 22-30, 2003.
[4] V.B. Bajic, S.L. Tan, Y. Suzuki, and S. Sugano, "Promoter Prediction Analysis on the Whole Human Genome," Nature Biotechnology, vol. 22, pp. 1467-1473, 2004.
[5] V.B. Bajic, M.R. Brent, R.H. Brown, A. Frankish, J. Harrow, U. Ohler, V.V. Solovyev, and S.L. Tan, "Performance Assessment of Promoter Predictions on ENCODE Regions in the EGASP Experiment," Genome Biology, vol. 7, (Suppl 1):S3, 2006.
[6] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. John Wiley & Sons, 2001.
[7] M.G. Reese, "Application of a Time-Delayed Neural Network to Promoter Annotation in the Drosophila melanogaster Genome," Computers & Chemistry, vol. 26, pp. 51-56, 2001.
[8] R.V. Davuluri, I. Grosse, and M.Q. Zhang, "Computational Identification of Promoters and First Exons in the Human Genome," Nature Genetics, vol. 29, pp. 412-417, 2001.
[9] L. Ponger and D. Mouchiroud, "CpGProD: Identifying CpG Islands Associated with Transcription Start Sites in Large Genomic Mammalian Sequences," Bioinformatics, vol. 18, pp. 631-633, 2002.
[10] T. Down and T. Hubbard, "Computational Detection and Location of Transcription Start Sites in Mammalian Genomic DNA," Genome Research, vol. 12, no. 3, pp. 458-461, 2002.
[11] G.B. Hutchinson, "The Prediction of Vertebrate Promoter Regions Using Differential Hexamer Frequency Analysis," Bioinformatics, vol. 12, pp. 391-398, 1996.
[12] S. Knudsen, "Promoter 2.0: For the Recognition of Pol II Promoter Sequences," Bioinformatics, vol. 15, pp. 356-361, 1999.
[13] M. Scherf, A. Klingenhoff, and T. Werner, "Highly Specific Localization of Promoter Regions in Large Genomic Sequences by PromoterInspector: A Novel Context Analysis Approach," J. Molecular Biology, vol. 297, pp. 599-606, 2000.
[14] J.C. Rajapakse and L.S. Ho, "Markov Encoding for Detection Signals in Genomic Sequences," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, pp. 131-142, Apr.-June 2005.
[15] R. Gangal and P. Sharma, "Human Pol II Promoter Prediction: Time Series Descriptors and Machine Learning," Nucleic Acids Res., vol. 33, pp. 1332-1336, 2005.
[16] S. Wu, X. Xie, A.W.-C. Liew, and H. Yan, "Eukaryotic Promoter Prediction Based on Relative Entropy and Positional Information," Physical Rev. E, vol. 75, p. 041908, 2007.
[17] U. Ohler, H. Niemann, G. chun Liao, and G.M. Rubin, "Joint Modeling of DNA Sequence and Physical Properties to Improve Eukaryotic Promoter Recognition," Bioinformatics, vol. 17, pp. S199-S206, 2001.
[18] J.R. Goñi, A. Pérez, D. Torrents, and M. Orozco, "Determining Promoter Location Based on DNA Structure First-Principles Calculations," Genome Biology, vol. 8, no. 12, p. R263, 2007.
[19] T. Abeel et al., "Generic Eukaryotic Core Promoter Prediction Using Structural Features of DNA," Genome Research, vol. 18, pp. 310-323, 2008.
[20] T. Abeel et al., "ProSOM: Core Promoter Prediction Based on Unsupervised Clustering of DNA Physical Profiles," Bioinformatics, vol. 24, pp. i24-i31, 2008.
[21] V.B. Bajic, S.H. Seah, A. Chong, G. Zhang, J.L.Y. Koh, and V. Brusic, "Dragon Promoter Finder: Recognition of Vertebrate RNA Polymerase II Promoters," Bioinformatics, vol. 18, pp. 198-199, 2002.
[22] V.B. Bajic and S.H. Seah, "Dragon Gene Start Finder: An Advanced System for Finding Approximate Locations of the Start of Gene Transcriptional Units," Genome Research, vol. 13, no. 8, pp. 1923-1929, 2003.
[23] X. Xie, S. Wu, K.-M. Lam, and H. Yan, "PromoterExplorer: An Effective Promoter Identification Method Based on the AdaBoost Algorithm," Bioinformatics, vol. 22, pp. 2722-2728, 2006.
[24] S. Sonnenburg, A. Zien, and G. Rätsch, "ARTS: Accurate Recognition of Transcription Starts in Human," Bioinformatics, vol. 22, pp. e472-e480, 2006.
[25] J. Wang, L.H. Ungar, H. Tseng, and S. Hannenhalli, "MetaProm: A Neural Network Based Meta-Predictor for Alternative Human Promoter Prediction," BMC Genomics, vol. 8, pp. 374-386, 2007.
[26] X. Zhao, Z. Xuan, and M.Q. Zhang, "Boosting with Stumps for Predicting Transcription Start Sites," Genome Biology, vol. 8, no. 2, p. R17, 2007.
[27] S.T. Smale and J.T. Kadonaga, "The RNA Polymerase II Core Promoter," Ann. Rev. Biochemistry, vol. 72, pp. 449-479, 2003.
[28] D. Takai and P.A. Jones, "Comprehensive Analysis of CpG Islands in Human Chromosomes 21 and 22," Nature Genetics, vol. 99, no. 6, pp. 3740-3745, 2002.
[29] V.X. Jin, G.A. Singer, F.J. Agosto-Pérez, S. Liyanarachchi, and R.V. Davuluri, "Genome-Wide Analysis of Core Promoter Elements from Conserved Human and Mouse Orthologous Pairs," BMC Bioinformatics, vol. 7, pp. 114-126, 2006.
[30] P. Carninci et al., "Genome-Wide Analysis of Mammalian Promoter Architecture and Evolution," Nature Genetics, vol. 38, pp. 626-635, 2006.
[31] A.G. Pedersen, P. Baldi, Y. Chauvin, and S. Brunak, "DNA Structure in Human RNA Polymerase II Promoters," J. Molecular Biology, vol. 281, pp. 663-673, 1998.
[32] Y. Fukue, N. Sumida, J. Tanase, and T. Ohyama, "A Highly Distinctive Mechanical Property Found in the Majority of Human Promoters and Its Transcriptional Relevance," Nucleic Acids Research, vol. 33, pp. 3821-3827, 2005.
[33] X.-Q. Cao, J. Zeng, and H. Yan, "Structural Property of Regulatory Elements in Human Promoters," Physical Rev. E, vol. 77, no. 4, p. 041908, 2008.
[34] P. Akan and P. Deloukas, "DNA Sequence and Structural Properties as Predictors of Human and Mouse Promoters," Gene, vol. 410, pp. 165-176, 2008.
[35] X.-Q. Cao, J. Zeng, and H. Yan, "Structural Properties of Replication Origins in Yeast DNA Sequences," Physical Biology, vol. 5, p. 036012, 2008.
[36] R. Yamashita, Y. Suzuki, H. Wakaguri, K. Tsuritani, K. Nakai, and S. Sugano, "DBTSS: Database of Human Transcription Start Sites, Progress Report 2006," Nucleic Acids Research, vol. 34, pp. 86-89, 2006.
[37] G.J. McLachlan and K.E. Basford, Mixture Models : Inference and Applications to Clustering. M. Dekker, 1988.
[38] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[39] T.K. Ho, "Symmetries from Uniform Space Covering in Stochastic Discrimination," Proc. Joint IAPR Int'l Workshops, SSPR 2006 and SPR 2006, pp. 22-40, 2006.
[40] C.D. Schmid, R. Périer, V. Praz, and P. Bucher, "EPD in Its Twentieth Year: Towards Complete Promoter Coverage of Selected Model Organisms," Nucleic Acids Research, vol. 34, pp. 82-85, 2006.
[41] E. Wingender, P. Dietze, H. Karas, and R. Knuppel, "TRANSFAC: A Database on Transcription Factors and Their DNA Binding Sites," Nucleic Acids Research, vol. 24, pp. 238-241, 1996.
[42] I. Brukner, R. Sánchez, D. Suck, and S. Pongor, "Sequence-Dependent Bending Propensity of DNA as Revealed by DNase I: Parameters for Trinucleotides," EMBO J., vol. 14, pp. 1812-1818, 1995.
[43] M.J. Packer, M.P. Dauncey, and C.A. Hunter, "Sequence-Dependent DNA Structure: Tetranucleotide Conformational Maps," J. Molecular Biology, vol. 295, pp. 85-103, 2000.
[44] A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm," J. Royal Statistical Soc., Series B, vol. 39, no. 1, pp. 1-38, 1977.
[45] P. Domingos and M. Pazzani, "Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier," Proc. 13th Int'l Conf. Machine Learning (ICML '96), pp. 105-112, 1996.
[46] T.J.P. Hubbard et al., "Ensembl 2005," Nucleic Acids Research, vol. 33, pp. D447-D453, 2005.
[47] S. Saxonov, I. Daizadeh, A. Fedorov, and W. Gilbert, "EID: The Exon-Intron Database—An Exhaustive Database of Protein-Coding Intron-Containing Genes," Nucleic Acids Research, vol. 28, pp. 185-190, 2000.
[48] G. Pesole, S. Liuni, G. Grillo, F. Licciulli, F. Mignone, C. Gissi, and C. Saccone, "UTRdb and UTRsite: Specialized Databases of Sequences and Functional Elements of $5^{\prime}$ and $3^{\prime}$ Untranslated Regions of Eukaryotic mRNAs," Nucleic Acids Research, vol. 30, pp. 335-340, 2002.
[49] D. Benson, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and D. Wheeler, "Genbank," Nucleic Acids Research, vol. 35, pp. D21-D25, 2007.
[50] J. Zeng and Z.-Q. Liu, "Type-2 Fuzzy Hidden Markov Models and Their Application to Speech Recognition," IEEE Trans. Fuzzy Systems, vol. 14, no. 3, pp. 454-467, June 2006.
[51] J. Zeng, L. Xie, and Z.-Q. Liu, "Type-2 Fuzzy Gaussian Mixture Models," Pattern Recognition, vol. 41, no. 12, pp. 3636-3643, 2008.
[52] X. Li, J. Zeng, and H. Yan, "PCA-HPR: A Principle Component Analysis Model for Human Promoter Recognition," Bioinformation, vol. 2, no. 9, pp. 373-378, 2008.
[53] J. Zeng, S. Zhu, and H. Yan, "Towards Accurate Human Promoter Recognition: A Review of Currently Used Sequence Features and Classification Methods," Briefings in Bioinformatics, vol. 10, pp. 498-508, 2009.
[54] J. Zeng, X.-Q. Cao, H. Zhao, and H. Yan, "Finding Human Promoter Groups Based on DNA Physical Properties," Physical Rev. E, vol. 80, p. 041917, 2009.
[55] X.-Q. Cao, J. Zeng, and H. Yan, "Physical Signals for Protein-DNA Recognition," Physical Biology, vol. 6, p. 036012, 2009.
[56] J. Zeng, S. Zhu, A.W.-C. Liew, and H. Yan, "Multiconstrained Gene Clustering Based on Generalized Projections," BMC Bioinformatics, vol. 11, p. 164, 2010.

Index Terms:
Promoter recognition, feature extraction, classifier combination, genome analysis.
Citation:
Jia Zeng, Xiao-Yu Zhao, Xiao-Qin Cao, Hong Yan, "SCS: Signal, Context, and Structure Features for Genome-Wide Human Promoter Recognition," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 550-562, July-Sept. 2010, doi:10.1109/TCBB.2008.95
Usage of this product signifies your acceptance of the Terms of Use.