This Article 
 Bibliographic References 
 Add to: 
Combining Sequence and Time Series Expression Data to Learn Transcriptional Modules
July-September 2005 (vol. 2 no. 3)
pp. 194-202
Our goal is to cluster genes into transcriptional modules—sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily available for organisms such as S. cereviseae as a result of prior computational studies or experimental results. We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules. Starting with a set of motifs representing known or putative regulatory elements (transcription factor binding sites) and the counts of occurrences of these motifs in each gene's promoter region, together with a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. We also present a technique based on the Jensen-Shannon entropy contributions of motifs in the learned model for associating the most significant motifs to each module. Thus, the algorithm gives a global approach for associating sets of regulatory elements to "modules” of genes with similar time series expression profiles. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course data sets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.

[1] T.L. Bailey and C. Elkan, “Unsupervised Learning of Multiple Motifs in Biopolymers Using EM,” Machine Learning, vol. 21, nos. 1-2, pp. 51-80, 1995.
[2] L.D. Baker and A.K. McCallum, “Distributional Clustering of Words for Text Classification,” Proc. SIGIR-98, 21st ACM Int'l Conf. Research and Development in Information Retrieval, pp. 96-103, 1998.
[3] Z. Bar-Joseph, G. Gerber, D.K. Gifford, T.S. Jaakkola, and I. Simon, “A New Approach to Analyzing Gene Expression Time Series Data,” Proc. RECOMB Conf., 2002.
[4] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, and M. Jordan, “Matching Words and Pictures,” J. Machine Learning Research, vol. 3, pp. 1107-1135, 2003.
[5] K. Barnard and D. Forsyth, “Learning the Semantics of Words and Pictures,” Proc. Int'l Conf. Computer Vision, vol. 2, pp. 408-415, 2001.
[6] M.A. Beer and S. Tavazoie, “Predicting Gene Expression from Sequence,” Cell, vol. 117, no. 2, pp. 185-98, Apr. 2004.
[7] T.M. Cover and J.A. Thomas, Elements of Information Theory. New York: John Wiley, 1990.
[8] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences, vol. 95, pp. 14863-14868, 1998.
[9] T.S. Spellman et al., “Comprehensive Identification of Cell Cycle-Related Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization,” Molecular Biology of the Cell, vol. 9, pp. 3273-3297, 1998.
[10] N. Friedman, “PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles,” technical report, Stanford Univ., 2003.
[11] I. Holmes and W.J. Bruno, “Finding Regulatory Elements Using Joint Likelihoods for Sequence and Expression Profile Data,” Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology, pp. 202-210, 2000.
[12] J.D. Hughes, P.W. Estep, S. Tavazoie, and G.M. Church, “Computational Identification of Cis-Regulatory Elements Associated with Groups of Functionally Related Genes in Saccharomyces Cerevisiae,” J. Molecular Biology, vol. 296, no. 5, pp. 1205-1214, 2000.
[13] J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, and N. Barkai, “Revealing Modular Organization in the Yeast Transcriptional Network,” Nature Genetics, vol. 31, pp. 370-377, 2002.
[14] G. James and T. Hastie, “Functional Linear Discriminant Analysis for Irregularly Sampled Curves,” J. Royal Statistical Soc., 2001.
[15] J. Lin, “Divergence Measures Based on the Shannon Entropy,” IEEE Trans. Information Theory, vol. 37, pp. 145-151, 1991.
[16] N.F. Lowndes, A.L. Johnson, L. Breeden, and L.H. Johnston, “Swi6 Protein is Required for Transcription of the Periodically Expressed DNA Synthesis Genes in Budding Yeast,” Nature, vol. 357, pp. 505-508, 1992.
[17] N.F. Lowndes, A.L. Johnson, and L.H. Johnston, “Coordination of Expression of DNA Synthesis Genes in Budding Yeast by Cell-Cycle Regulated Trans Factor,” Nature, vol. 350, pp. 247-250, 1991.
[18] A. McCallum and K. Nigam, “A Comparison of Event Models for Naive Bayes Text Classification,” Proc. AAAI-98 Workshop Learning for Text Categorization, 1998.
[19] Y. Pilpel, P. Sudarsanam, and G.M. Church, “Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements,” Nature Genetics, vol. 2, pp. 153-159, 2001.
[20] N. Rajewsky, M. Vergassola, U. Gaul, and E.D. Siggia, “Computational Detection of Genomic CIS Regulatory Modules, Applied to Body Patterning in the Early Drosophila Embryo,” BMC Bioinformatics, vol. 3, no. 30, 2002.
[21] E. Segal, M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Koller, and N. Friedman, “Module Networks: Discovering Regulatory Modules and Their Condition Specific Regulators from Gene Expression Data,” Nature Genetics, vol. 34, no. 2, pp. 166-176, 2003.
[22] E. Segal, R. Yelensky, and D. Koller, “Genome-Wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression,” Bioinformatics, vol. 19, 2003.
[23] N. Slonim, N. Friedman, and N. Tishby, “Agglomerative Multivariate Information Bottleneck,” Proc. Neural Information Processing Systems Conf. (NIPS-12), pp. 617-623, 2000.
[24] D. Thomas and Y. Surdin-Kerjan, “Metabolism of Sulfur Amino Acids in Saccharomyces Cerevisiae,” Microbiology and Molecular Biology Rev., vol. 61, pp. 503-532, 1997.
[25] G.C. Tseng, M.-K. Oh, L. Rohlin, J.C. Liao, and W. Wong, “Issues in cDNA Microarray Analysis: Quality Filtering, Channel Normalization, Models of Variations and Assessment of Gene Effects,” Nucleic Acids Research, vol. 29, no. 12, pp. 2549-2557, 2001.
[26] E. Wingender, X. Chen, R. Hehl, H. Karas, I. Liebich, V. Matys, T. Meinhardt, M. Prüss, I. Reuter, and F. Schacherer, “TRANSFAC: An Integrated System for Gene Expression Regulation,” Nucleic Acids Research, vol. 28, pp. 316-319, 2000.

Index Terms:
Index Terms- Gene regulation, clustering, heterogeneous data.
Anshul Kundaje, Manuel Middendorf, Feng Gao, Chris Wiggins, Christina Leslie, "Combining Sequence and Time Series Expression Data to Learn Transcriptional Modules," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 3, pp. 194-202, July-Sept. 2005, doi:10.1109/TCBB.2005.34
Usage of this product signifies your acceptance of the Terms of Use.