This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Analyzing Gene Expression Time-Courses
July-September 2005 (vol. 2 no. 3)
pp. 179-193
Measuring gene expression over time can provide important insights into basic cellular processes. Identifying groups of genes with similar expression time-courses is a crucial first step in the analysis. As biologically relevant groups frequently overlap, due to genes having several distinct roles in those cellular processes, this is a difficult problem for classical clustering methods. We use a mixture model to circumvent this principal problem, with hidden Markov models (HMMs) as effective and flexible components. We show that the ensuing estimation problem can be addressed with additional labeled data—partially supervised learning of mixtures—through a modification of the Expectation-Maximization (EM) algorithm. Good starting points for the mixture estimation are obtained through a modification to Bayesian model merging, which allows us to learn a collection of initial HMMs. We infer groups from mixtures with a simple information-theoretic decoding heuristic, which quantifies the level of ambiguity in group assignment. The effectiveness is shown with high-quality annotation data. As the HMMs we propose capture asynchronous behavior by design, the groups we find are also asynchronous. Synchronous subgroups are obtained from a novel algorithm based on Viterbi paths. We show the suitability of our HMM mixture approach on biological and simulated data and through the favorable comparison with previous approaches. A software implementing the method is freely available under the GPL from http://ghmm.org/gql.

[1] Y. Yang, S. Dudoit, P. Luu, D. Lin, V. Peng, J. Ngai, and T. Speed, “Normalization for cDNA Microarray Data: A Robust Composite Method Addressing Single and Multiple Slide Systematic Variation,” Nucleic Acids Research, vol. 30, no. 4, Feb. 2002.
[2] Z. Bar-Joseph, “Analyzing Time Series Gene Expression Data,” Bioinformatics, vol. 20, no. 16, pp. 2493-2503, Nov. 2004.
[3] M. Eisen, P. Spellman, P. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Science, vol. 95, pp. 14, 863-868, 1998.
[4] A. Gasch, P. Spellman, C. Kao, O. Carmel-Harel, M. Eisen, G. Storz, D. Botstein, and P. Brown, “Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes,” Molecular Biology of the Cell, vol. 11, pp. 4241-4257, 2000.
[5] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church, “Systematic Determination of Genetic Network Architecture,” Nature Genetics, vol. 22, pp. 281-285, 1999.
[6] S.A. Rifkin and J. Kim, “Geometry of Gene Expression Dynamics,” Bioinformatics, vol. 18, no. 9, pp. 1176-1183, Sept. 2002.
[7] I.L. MacDonald and W. Zucchini, Hidden Markov and Other Models for Discrete-Valued Time Series. London: Chapman & Hall, 1997.
[8] Z. Bar-Joseph, G. Gerber, D.K. Gifford, and T.S. Jaakkola, “A New Approach to Analyzing Gene Expression Time Series Data,” Proc. Sixth Ann. Int'l Conf. Research in Comp. Molecular Biology, 2002.
[9] M.F. Ramoni, P. Sebastiani, and I.S. Kohane, “Cluster Analysis of Gene Expression Dynamics,” Proc. Nat'l Academy of Science, vol. 99, no. 14, pp. 9121-9126, July 2002.
[10] M.F. Ramoni, P. Sebastiani, and P.R. Cohen, “Bayesian Clustering by Dynamics,” Machine Learning, vol. 47, no. 1, pp. 91-121, Apr. 2002.
[11] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzzo, “Model-Based Clustering and Data Transformations for Gene Expression Data,” Bioinformatics, vol. 17, no. 10, pp. 977-987, 2001.
[12] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Futcher, “Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273-3297, Dec. 1998.
[13] A. Krogh, M. Brown, I.S. Mian, K. Sjolander, and D. Haussler, “Hidden Markov Models in Computational Biology. Applications to Protein Modeling,” J. Molecular Biology, vol. 235, no. 5, pp. 1501-1531, Feb. 1994.
[14] P. Smyth, “Probabilistic Model-Based Clustering of Multivariate and Sequential Data,” Proc. Seventh Int'l Workshop AI and Statistics, D. Heckerman and J. Whittaker, eds., 1999.
[15] B. Knab, “Erweiterungen von Hidden-Markov-Modellen zur Analyse ökonomischer Zeitreihen,” PhD dissertation, Informatik, Universität zu Köln, 2000.
[16] S.G.I. Cadez and P. Smyth, “A General Probabilistic Framework for Clustering Individuals,” ACM SIGKDD 2000 Proc., 2000.
[17] B. Wichern, “Hidden-Markov-Modelle zur Analyse und Simulation von Finanzzeitreihen,” PhD dissertation, Informatik, Universität zu Köln, 2001.
[18] B. Knab, A. Schliep, B. Steckemetz, and B. Wichern, “Model-Based Clustering with Hidden Markov Models and Its Application to Financial Time-Series Data,” Between Data Science and Applied Data Analysis, M. Schader, W. Gaul, and M. Vichi, eds., Springer, pp. 561-569, 2003.
[19] V. Castelli and T.M. Cover, “On the Exponential Value of Labeled Samples,” Pattern Recognition Letters, vol. 16, pp. 105-111, 1994.
[20] M. Seeger, “Learning with Labeled and Unlabeled Data,” Inst. for Adaptive and Neural Computation, technical report, Univ. of Edinburgh, 2001.
[21] M. Szummer and T. Jaakkola, “Partially Labeled Classification with Markov Random Walks,” Neural Information Processing Systems (NIPS), vol. 14, 2002.
[22] A. Blum and S. Chawla, “Learning from Labeled and Unlabeled Data Using Graph Mincuts,” Proc. Int'l Conf. Machine Learning, 2001.
[23] M. Belkin, “Problems of Learning on Manifolds,” PhD dissertation, Univ. of Chicago, 2003.
[24] F.G. Cozman, I. Cohen, and M.C. Cirelo, “Semi-Supervised Learning of Mixture Models,” Proc. 20th Int'l Conf. Machine Learning (ICML), 2003.
[25] V. Vapnik, The Nature of Statistical Learning Theory. Wiley, 1998.
[26] S. Basu, M. Bilenko, and R.J. Mooney, “A Probabilistic Framework for Semi-Supervised Clustering,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), Aug. 2004.
[27] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257-285, Feb. 1989.
[28] C. Burge and S. Karlin, “Prediction of Complete Gene Structures in Human Genomic DNA,” J. Molecular Biology, vol. 268, no. 1, pp. 78-94, Apr. 1997.
[29] A. Schliep, A. Schönhuth, and C. Steinhoff, “Using Hidden Markov Models to Analyze Gene Expression Time Course Data,” Bioinformatics, vol. 19, no. 1, pp. 255-263, July 2003.
[30] A. Schliep, C. Steinhoff, and A. Schönhuth, “Robust Inference of Groups in Gene Expression Time-Courses Using Mixtures of HMM,” Bioinformatics, vol. 20, no. 1, pp. 283-289, July 2004.
[31] I.G. Costa, A. Schonhuth, and A. Schliep, “The Graphical Query Language: A Tool for Analysis of Gene Expression Time-Courses,” Bioinformatics, vol. 21, no. 10, pp. 2544-2545, 2005.
[32] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc.: Series B, vol. 39, pp. 1-38, 1977.
[33] A. Stolcke and S. Omohundro, “Hidden Markov Model Induction by Bayesian Model Merging,” Proc. Neural Information Processing Systems 5 (NIPS-5), 1992.
[34] A. Schliep, “Learning Hidden Markov Model Topology,” PhD dissertation, Center for Applied Computer Science, Univ. of Cologne, 2001.
[35] B.H. Juang and L.R. Rabiner, “A Probabilistic Distance Measure for Hidden Markov Models,” AT&T Technical J., vol. 64, no. 2, pp. 391-408, 1985.
[36] W. Pedrycz, “Fuzzy Sets in Pattern Recognition: Methodology and Methods,” Pattern Recognition, vol. 23, nos. 1/2, pp. 121-146, 1990.
[37] G. McLachlan and K. Basford, Mixture Models: Inference and Applications to Clustering. New York, Basel: Marcel Dekker, Inc., 1988.
[38] G. McLachlan and D. Peel, Finite Mixture Models, Wiley Series in Probability and Statistics. New York: Wiley, 2000.
[39] C. Wu, “On the Convergence of the EM Algorithm,” Annals of Statistics, pp. 95-103, 1983.
[40] R. Boyles, “On the Convergence of the EM Algorithm,” J. Royal Statistical Soc.: Series B, pp. 47-50, 1983.
[41] J.A. Bilmes, “A Gentle Tutorial of the EM Algorithm and Its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” Technical Report TR-97-021, Int'l Computer Science Inst., Berkeley, Calif., 1998.
[42] H. C, E. Loken, and J.L. Schafer, “Difficulties in Drawing Inferences with Finite-Mixture Models: A Simple Example with a Simple Solution,” Am. Statistician, vol. 58, no. 2, pp. 152-158, 2004.
[43] K. Nigam, A.K. McCallum, S. Thrun, and T. Mitchell, “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning, 1999.
[44] G. Schwarz, “Estimating the Dimension of a Model,” The Annals of Statistics, vol. 6, pp. 461-464, 1978.
[45] T.G.O. Consortium, “Gene Ontology: Tool for the Unification of Biology,” Nature Genetics, vol. 25, pp. 25-29, 2000.
[46] F. Sokal and R.R. Rohlf, Biometry. New York: W.H. Freeman and Company, 1995.
[47] A. Reiner, D. Yekutieli, and Y. Benjamini, “Identifying Differentially Expressed Genes Using False Discovery Rate Controlling Procedures,” Bioinformatics, vol. 19, no. 3, pp. 368-375, 2003.
[48] T. Beissbarth and T.P. Speed, “GOstat: Find Statistically Overrepresented Gene Ontologies within a Group of Genes,” Bioinformatics, vol. 20, no. 9, pp. 1464-1465, 2004.
[49] V.R. Iyer, M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J.C. Lee, J.M. Trent, L.M. Staudt, J.R. Hudson, M.S. Boguski, D. Lashkari, D. Shalon, D. Botstein, and P.O. Brown, “The Transcriptional Program in the Response of Human Fibroblasts to Serum,” Science, vol. 283, no. 5398, pp. 83-87, Jan. 1999.
[50] R. Cho, M. Campbell, E. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. Wolfsberg, A. Gabrielian, D. Landsman, J. Lockhart, and W. Davis, “A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle,” Molecular Cell, vol. 2, pp. 65-73, 1998.
[51] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P.O. Brown, and I. Herskowitz, “The Transcriptional Program of Sporulation in Budding Yeast,” Science, vol. 282, no. 5389, pp. 699-705, 1998.
[52] M.L. Whitfield, G. Sherlock, A.J. Saldanha, J.I. Murray, C.A. Ball, K.E. Alexander, J.C. Matese, C.M. Perou, M.M. Hurt, P.O. Brown, and D. Botstein, “Identification of Genes Periodically Expressed in the Human Cell Cycle and Their Expression in Tumors,” Molecular Biology of the Cell, vol. 13, no. 6, pp. 1977-2000, June 2002.
[53] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall Int'l, 1998.
[54] C.M. C. Milligan G.W., “A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis,” Multivariate Behavior Research, vol. 21, pp. 441-458, 1986.
[55] B. Kraus, M. Pohlschmidt, A.L. Leung, G.G. Germino, A. Snarey, M.C. Schneider, S.T. Reeders, and A.M. Frischauf, “A Novel Cyclin Gene (CCNF) in the Region of the Polycystic Kidney Disease Gene (PKD1),” Genomics, vol. 24, no. 1, pp. 27-33, Nov. 1994.
[56] M. de Hoon, S. Imoto, J. Nolan, and S. Miyano, “Open Source Clustering Software,” Bioinformatics, vol. 20, no. 9, pp. 1453-1454, 2004.
[57] “The General Hidden Markov Model Library (GHMM),” http:/ghmm.org, 2003.

Index Terms:
Index Terms- Mixture modeling, hidden Markov models, partially supervised learning, gene expression, time-course analysis.
Citation:
Alexander Schliep, Ivan G. Costa, Christine Steinhoff, Alexander Sch?nhuth, "Analyzing Gene Expression Time-Courses," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 3, pp. 179-193, July-Sept. 2005, doi:10.1109/TCBB.2005.31
Usage of this product signifies your acceptance of the Terms of Use.