The Community for Technology Leaders
RSS Icon
Issue No.01 - January-March (2010 vol.7)
pp: 37-49
Petri Pehkonen , University of Kuopio, Kuopio
Garry Wong , University of Kuopio, Kuopio
Petri Törönen , University of Helsinki, Kuopio
Segmentation aims to separate homogeneous areas from the sequential data, and plays a central role in data mining. It has applications ranging from finance to molecular biology, where bioinformatics tasks such as genome data analysis are active application fields. In this paper, we present a novel application of segmentation in locating genomic regions with coexpressed genes. We aim at automated discovery of such regions without requirement for user-given parameters. In order to perform the segmentation within a reasonable time, we use heuristics. Most of the heuristic segmentation algorithms require some decision on the number of segments. This is usually accomplished by using asymptotic model selection methods like the Bayesian information criterion. Such methods are based on some simplification, which can limit their usage. In this paper, we propose a Bayesian model selection to choose the most proper result from heuristic segmentation. Our Bayesian model presents a simple prior for the segmentation solutions with various segment numbers and a modified Dirichlet prior for modeling multinomial data. We show with various artificial data sets in our benchmark system that our model selection criterion has the best overall performance. The application of our method in yeast cell-cycle gene expression data reveals potential active and passive regions of the genome.
Biology and genetics, clustering, classification, association rules, segmentation.
Petri Pehkonen, Garry Wong, Petri Törönen, "Heuristic Bayesian Segmentation for Discovery of Coexpressed Genes within Genomic Regions", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 1, pp. 37-49, January-March 2010, doi:10.1109/TCBB.2008.56
[1] N. Gilbert and B. Ramsahoye, "The Relationship between Chromatin Structure and Transcriptional Activity in Mammalian Genomes," Briefings in Functional Genomics and Proteomics, vol. 4, pp. 129-142, 2005.
[2] P. Meyer, "Chromatin Remodeling," Current Opinion in Plant Biology, vol. 4, pp. 457-462, 2001.
[3] D. Zhou and R. Yang, "Global Analysis of Gene Transcription Regulation in Prokaryotes," Cellular and Molecular Life Sciences, vol. 63, no. 19-20, pp. 2260-2290, 2006.
[4] S.A. Teichmann and M.M. Babu, "Gene Regulatory Network Growth by Duplication," Nature Genetics, vol. 36, pp. 492-496, 2004.
[5] J. Santos, M. Herranz, M. Fernandez, C. Vaquero, P. Lopez, and J. Fernandez-Piqueras, "Evidence of a Possible Epigenetic Inactivation Mechanism Operating on a Region of Mouse Chromosome 19 in Gamma-Radiation-Induced Thymic Lymphomas," Oncogene, vol. 20, no. 17, pp. 2186-2189, 2001.
[6] B.A. Cohen, R.D. Mitra, J.D. Hughesand, and G.M. Church, "A Computational Analysis of Whole-Genome Expression Data Reveals Chromosomal Domains of Gene Expression," Nature Genetics, vol. 26, pp. 183-186, 2000.
[7] P.J. Roy, J.M. Stuart, J. Lund, and S.K. Kim, "Chromosomal Clustering of Muscle-Expressed Genes in Caenorhabditis Elegans," Nature, vol. 418, no. 6901, pp. 975-979, 2002.
[8] D.A. Hsu, "Detecting Shifts of Parameter in Gamma Sequences with Applications to Stock Price and Air Traffic Flow Analysis," J. Am. Statistical Assoc., vol. 74, pp. 31-40, 1979.
[9] P. Bernaola-Galvan, I. Grosse, P. Carpena, J.L. Oliver, R. Roman-Roldan, and H.E. Stanley, "Finding Borders between Coding and Noncoding DNA Regions by an Entropic Segmentation Method," Physical Rev. Letters, vol. 85, pp. 1342-1345, 2000.
[10] W. Li, P. Bernaola-Galvan, F. Haghighi, and I. Grosse, "Applications of Recursive Segmentation to the Analysis of DNA Sequences," Computers & Chemistry, vol. 26, pp. 491-510, 2002.
[11] R. Bellman, "On the Approximation of Curves by Line Segments Using Dynamic Programming," Comm. ACM, vol. 4, p. 284, 1961.
[12] J.S. Liu and C.E. Lawrence, "Bayesian Inference on Biopolymer Models," Bioinformatics, vol. 15, pp. 38-52, 1999.
[13] A.E. Kehagias and V.P. Nidelkou, "A Dynamic Programming Segmentation Procedure for Hydrological and Environmental Time Series," Stochastic Environmental Research and Risk Assessment, vol. 20, pp. 77-94, 2005.
[14] J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmäki, and H. Toivonen, "Time Series Segmentation for Context Recognition in Mobile Devices," Proc. IEEE Int'l Conf. Data Mining (ICDM), 2001.
[15] M. Salmenkivi, J. Kere, and H. Mannila, "Genome Segmentation Using Piecewise Constant Intensity Models and Reversible Jump MCMC," Bioinformatics, European Computational Biology Conference, supplement 2, vol. 18, pp. S211-S218, 2001.
[16] S. Guha, N. Koudas, and K. Shim, "Data-Streams and Histograms," Proc. 33rd Ann. ACM Symp. Theory of Computing (STOC '01), pp. 471-475, 2001.
[17] E. Terzi and P. Tsaparas, "Efficient Algorithms for Sequence Segmentation," Proc. Sixth SIAM Int'l Conf. Data Mining (SDM '06), pp. 314-325, 2006.
[18] U. Ramer, "An Iterative Procedure for the Polygonal Approximation of Plane Curves," Computer Graphics and Image Processing, vol. 1, pp. 244-256, 1972.
[19] D.H. Douglas and T.K. Peucker, "Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or Its Caricature," Canadian Cartographer, vol. 10, pp. 112-122, 1973.
[20] A. Sen and M.S. Srivastava, "On Tests for Detecting Change in Mean," Annals of Statistics, vol. 3, pp. 98-108, 1975.
[21] A.B. Olshen, E.S. Venkatraman, R. Lucito, and M. Wigler, "Circular Binary Segmentation for the Analysis of Array-Based DNA Copy Number Data," Biostatistics, vol. 5, pp. 557-572, 2004.
[22] J. Fridlyand, A.M. Snijders, D. Pinkel, D.G. Albertson, and A.N. Jain, "Hidden Markov Models Approach to the Analysis of Array CGH Data," J. Multivariate Analysis, vol. 90, pp. 132-153, 2004.
[23] N. Zhang, "Change-Point Detection and Sequence Alignment: Statistical Problems of Genomics," PhD dissertation, Statistics Dept., Stanford Univ., 2005.
[24] H. Akaike, "A New Look at the Statistical Model Identification," IEEE Trans. Automatic Control, vol. 19, pp. 716-723, 1974.
[25] G. Schwarz, "Estimating the Dimension of a Model," Annals of Statistics, vol. 6, pp. 461-464, 1978.
[26] J. Lin, "Divergence Measures Based on the Shannon Entropy," IEEE Trans. Information Theory, vol. 37, no. 1, 1991.
[27] I. Grosse, P. Bernaola-Galvan, P. Carpena, R. Roman-Roldan, J. Oliver, and H.E. Stanley, "Analysis of Symbolic Sequences Using the Jensen-Shannon Divergence," Physical Rev. E, vol. 65, p. 041905, 2002.
[28] P. Marttinen, J. Corander, P. Toronen, and L. Holm, "Bayesian Search of Functionally Divergent Protein Subgroups and Their Function Specific Residues," Bioinformatics, vol. 22, pp. 2466-2474, 2006.
[29] J. Corander, P. Waldmann, P. Marttinen, and M.J. Sillanpaa, "BAPS 2: Enhanced Possibilities for the Analysis of Genetic Population Structure," Bioinformatics, vol. 20, pp. 2363-2369, 2004.
[30] J. Corander, P. Waldmann, and M.J. Sillanpaa, "Bayesian Analysis of Genetic Differentiation between Populations," Genetics, vol. 163, pp. 367-374, 2003.
[31] M. Blei, A.Y. Ng, and M.I. Jordan, "Latent Dirichlet Allocation," J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[32] W.L. Buntine, "Variational Extensions to EM and Multinomial PCA," Proc. 13th European Conf. Machine Learning, pp. 23-34, 2002.
[33] B.P. Carlin and T.A. Louis, Bayes and Empirical Bayes Methods for Data Analysis, p. 419. Chapman and Hall, 2000.
[34] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, and J.C. Wootton, "Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment," Science, vol. 262, pp. 208-214, 1993.
[35] V.E. Ramensky, V.J. Makeev, M.A. Roytberg, and V.G. Tumanyan, "DNA Segmentation through the Bayesian Approach," J. Computational Biology, vol. 7, p. 215, 2000.
[36] P. Kontkanen, P. Myllymäki, W. Buntine, J. Rissanen, and H. Tirri, "An MDL Framework for Data Clustering," Technical Report 2002-8, Helsinki Inst. for Information Technology (HIIT), Helsinki Univ. of Tech nology, 2002.
[37] W.M. Rand, "Objective Criteria for the Evaluation of Clustering Methods," J. Am. Statistical Assoc., vol. 66, pp. 846-850, 1971.
[38] C.E. Shannon and W.W. Weaver, The Mathematical Theory of Communication. Univ. of Illinois Press, 1949.
[39] A. Jain and R. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.
[40] D.H. Johnson and S. Sinanovic, "Symmetrizing the Kullback-Leibler Distance," IEEE Trans. Information Theory, 2001.
[41] Y. Lee, "Information-Theoretic Distortion Measures for Speech Recognition," IEEE Trans. Signal Processing, vol. 39, p. 330, 1991.
[42] A.D.R. McQuarrie and C. Tsai, Regression and Time Series Model Selection, p. 455, 1998.
[43] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Futcher, "Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization," Molecular Biology of the Cell, vol. 9, pp. 3273-3297, 1998.
[44] P. Pehkonen, G. Wong, and P. Toronen, "Theme Discovery from Gene Lists for Identification and Viewing of Multiple Functional Groups," BMC Bioinformatics, vol. 6, p. 162, 2005.
[45] F. Paul, "Exact and Efficient Bayesian Inference for Multiple Changepoint Problems," Statistics and Computing, vol. 16, pp. 203-213, no. 11, 2006.
[46] J. Berger and L. Pericchi, "The Intrinsic Bayes Factor for Model Selection and Prediction," J. Am. Statistical Assoc., vol. 91, pp. 109-122, 1996.
[47] E. Raftery, D. Madigan, and J.A. Hoeting, "Bayesian Model Averaging for Linear Regression Models," J. Am. Statistical Assoc., vol. 92, pp. 179-191, 1997.
[48] http:/, 2008.
[49] P. Fearnhead, "Exact and Efficient Inference for Multiple Changepoint Problems," Statistics and Computing, vol. 16, pp. 203-213, 2006.
[50] P. Fearnhead and Z. Liu, "Online Inference for Multiple Changepoint Problems," J. Royal Statistical Soc., Series B, vol. 69, pp. 589-605.
[51] M. Hutter, "Exact Bayesian Regression of Piecewise Constant Functions," Bayesian Analysis, vol. 2, no. 1, pp. 1-30, 2007.
[52] M.C. Teixeira, P. Monteiro, P. Jain, S. Tenreiro, A.R. Fernandes, N.P. Mira, M. Alenquer, A.T. Freitas, A.L. Oliveira, and I. Sá-Correia, "The YEASTRACT Database: A Tool for the Analysis of Transcription Regulatory Associations in Saccharomyces Cerevisiae," Nucleic Acids Research, vol. 34, pp. D446-D451, 2006.
22 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool