This Article 
 Bibliographic References 
 Add to: 
Associative Clustering for Exploring Dependencies between Functional Genomics Data Sets
July-September 2005 (vol. 2 no. 3)
pp. 203-216
High-throughput genomic measurements, interpreted as cooccurring data samples from multiple sources, open up a fresh problem for machine learning: What is in common in the different data sets, that is, what kind of statistical dependencies are there between the paired samples from the different sets? We introduce a clustering algorithm for exploring the dependencies. Samples within each data set are grouped such that the dependencies between groups of different sets capture as much of pairwise dependencies between the samples as possible. We formalize this problem in a novel probabilistic way, as optimization of a Bayes factor. The method is applied to reveal commonalities and exceptions in gene expression between organisms and to suggest regulatory interactions in the form of dependencies between gene expression profiles and regulator binding patterns.

[1] M. Ashburner et al., “Gene Ontology: Tool for the Unification of Biology,” Nature Genetics, vol. 25, pp. 25-29, 2000.
[2] M.S. Bazaraa, H.D. Sherali, and C.M. Shetty, Nonlinear Programming: Theory and Algorithms. New York: Wiley, 1993.
[3] S. Becker, “Mutual Information Maximization: Models of Cortical Self-Organization,” Network: Computation in Neural Systems, vol. 7, pp. 7-31, 1996.
[4] S. Becker and G.E. Hinton, “Self-Organizing Neural Network that Discovers Surfaces in Random-Dot Stereograms,” Nature, vol. 355, pp. 161-163, 1992.
[5] M. Beer and S. Tavazoie, “Predicting Gene Expression from Sequence,” Cell, vol. 117, pp. 185-198, 2004.
[6] S. Bergmann, J. Ihmels, and N. Barkai, “Similarities and Differences in Genome-Wide Expression Data of Six Organisms,” PLoS Biology, vol. 2, pp. 85-93, 2004.
[7] C.M. Bishop, Neural Networks for Pattern Recognition. New York: Oxford Univ. Press, 1995.
[8] H. Bono and Y. Okazaki, “Functional Transcriptomes: Comparative Analysis of Biological Pathways and Processes in Eukaryotes to Infer Genetic Networks among Transcripts,” Current Opinion in Structural Biology, vol. 12, pp. 355-361, 2002.
[9] S.B. Carroll, “Genetics and the Making of Homo Sapiens,” Nature, vol. 422, pp. 849-857, 2003.
[10] R.J. Cho, M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway, L. Wodickaa, T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J. Lockhart, and R.W. Davis, “A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle,” Molecular Cell, vol. 2, pp. 65-73, 1998.
[11] A.G. Clark et al., “Inferring Nonneutral Evolution from Human-Chimp-Mouse Orthologous Gene Trios,” Science, vol. 302, pp. 1960-1963, 2003.
[12] B. Efron and R. Tibshirani, An Introduction to the Bootstrap. New York: Chapman & Hall, 1993.
[13] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences, USA, vol. 95, pp. 14863-14868, 1998.
[14] W. Enard et al. “Intra- and Inter-Specific Variation of Primate Gene Expression Patterns,” Science, vol. 296, pp. 340-343, 2002.
[15] R.M. Ewing and J.-M. Claverie, “EST Databases as Multi-Conditional Gene Expression Datasets,” Proc. Pacific Symp. Biocomputing, vol. 5, pp. 427-439, 2000.
[16] N. Friedman, M. Linial, I. Nachman, and D. Pe'er, “Using Bayesian Networks to Analyze Expression Data,” J. Computational Biology, vol. 7, pp. 559-584, 2000.
[17] N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby, “Multivariate Information Bottleneck,” Proc. 17th Conf. Uncertainty in Artificial Intelligence (UAI), pp. 152-161, San Francisco: Morgan Kaufmann, 2001.
[18] V. Ganti, J. Gehrke, R. Ramakrishnan, and W.-Y. Loh, “A Framework for Measuring Changes in Data Characteristics,” Proc. 18th ACM Symp. Principles of Database Systems, pp. 126-137, 1999.
[19] H. Ge, Z. Liu, G.M. Church, and M. Vidal, “Correlation between Transcriptome and Interactome Mapping Data from Saccharomyces Cerevisiae,” Nature Genetics, vol. 29, pp. 482-486, 2001.
[20] I.J. Good, “On the Application of Symmetric Dirichlet Distributions and Their Mixtures to Contingency Tables,” Annals of Statistics, vol. 4, pp. 1159-1189, 1976.
[21] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. New York: Springer, 2001.
[22] C.E. Horak, N.M. Luscombe, J. Qian, P. Bertone, S. Piccirrillo, M. Gerstein, and M. Snyder, “Complex Transcriptional Circuitry at the G1/S Transition in Saccharomyces Cerevisiae,” Genes and Development, vol. 16, pp. 3017-3033, 2002.
[23] D. Hosack, G. Dennis Jr., B. Sherman, H. Lane, and R. Lempicki, “Identifying Biological Themes within Lists of Genes with EASE,” Genome Biology, vol. 4, p. R70, 2003.
[24] H. Hotelling, “Relations between Two Sets of Variates,” Biometrika, vol. 28, pp. 321-377, 1936.
[25] T.R. Hughes et al. “Functional Discovery via a Compendium of Expression Profiles,” Cell, vol. 102, pp. 109-126, 2000.
[26] J.L. Jiménez, M.P. Mitchell, and J.G. Sgouros, “Microarray Analysis of Orthologous Genes: Conservation of the Translational Machinery across Species at the Sequence and Expression Level,” Genome Biology, vol. 4, p. R4, 2002.
[27] S. Kaski, J. Sinkkonen, and A. Klami, “Discriminative Clustering,” Neurocomputing, to appear.
[28] R.E. Kass and A.E. Raftery, “Bayes Factors,” J. Am. Statistical Assoc., vol. 90, pp. 773-795, 1995.
[29] M.K. Kerr and G.A. Churchill, “Bootstrapping Cluster Analysis: Assessing the Reliability of Conclusions from Microarray Experiments,” Proc. Nat'l Academy of Sciences, vol. 98, pp. 8961-8965, 2001.
[30] P. Khaitovich, G. Weiss, M. Lachmann, I. Hellmann, W. Enard, B. Muetzel, U. Wirkner, W. Ansorge, and S. Pääbo, “A Neutral Model of Transcriptome Evolution,” PLoS Biology, vol. 2, pp. 0682-0689, 2004.
[31] T.I. Lee et al., “Transcriptional Regulatory Networks in Saccharomyces Cerevisiae,” Science, vol. 298, pp. 799-804, 2002.
[32] G.J. McLachlan, K.-A. Do, and C. Ambroise, Analyzing Microarray Gene Expression Data. New York: Wiley, 2004.
[33] S.R. Neves, P.T. Ram, and R. Iyengar, “G Protein Pathways,” Science, vol. 296, pp. 1636-1639, 2002.
[34] J. Nikkilä, P. Törönen, S. Kaski, J. Venna, E. Castrén, and G. Wong, “Analysis and Visualization of Gene Expression Data Using Self-Organizing Maps,” Neural Networks, special issue on new developments on self-organizing maps, vol. 15, pp. 953-966, 2002.
[35] J. Peltonen, J. Sinkkonen, and S. Kaski, “Sequential Information Bottleneck for Finite Data,” Proc. 21st Int'l Conf. Machine Learning, pp. 647-654, 2004.
[36] E. Segal, M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Koller, and N. Friedman, “Module Networks: Identifying Regulatory Modules and their Condition-Specific Regulators from Gene Expression Data,” Nature Genetics, vol. 34, pp. 166-176, 2003.
[37] J. Sinkkonen and S. Kaski, “Clustering Based on Conditional Distributions in an Auxiliary Space,” Neural Computation, vol. 14, pp. 217-239, 2002.
[38] J. Sinkkonen, S. Kaski, J. Nikkilä, and L. Lahti, “Associative Clustering (AC): Technical Details,” Technical Report A84, Publications in Computer and Information Science, Laboratory of Computer and Information Science, Helsinki Univ. of Tech nology, 2005.
[39] J. Sinkkonen, J. Nikkilä, L. Lahti, and S. Kaski, “Associative Clustering,” Proc. 15th European Conf. Machine Learning, pp. 396-406, 2004.
[40] N. Slonim, “The Information Bottleneck: Theory and Applications,” PhD thesis, Hebrew Univ., 2002.
[41] N. Slonim, N. Friedman, and N. Tishby, “Unsupervised Document Classification Using Sequential Information Maximization,” Proc. 25th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 129-136, ACM Press, 2002.
[42] P. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Futcher, “Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization,” Molecular Biology of the Cell, vol. 9, pp. 3273-3297, 1998.
[43] A.I. Su et al., “Large-Scale Analysis of the Human and Mouse Transcriptomes,” Proc. Nat'l Academy of Sciences, USA, vol. 99, pp. 4465-4470, 2002.
[44] N. Tishby, F.C. Pereira, and W. Bialek, “The Information Bottleneck Method,” Proc. 37th Ann. Allerton Conf. Comm., Control, and Computing, pp. 368-377, 1999.
[45] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R.B. Altman, “Missing Value Estimation Methods for DNA Microarrays,” Bioinformatics, vol. 17, pp. 520-525, 2001.
[46] D.L. Wheeler et al., “Database Resources of the National Center for Biotechnology,” Nucleic Acids Research, vol. 31, pp. 28-33, 2003.

Index Terms:
Index Terms- Biology and genetics, clustering, contingency table analysis, machine learning, multivariate statistics.
Samuel Kaski, Janne Nikkil?, Janne Sinkkonen, Leo Lahti, Juha E.A. Knuuttila, Christophe Roos, "Associative Clustering for Exploring Dependencies between Functional Genomics Data Sets," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 3, pp. 203-216, July-Sept. 2005, doi:10.1109/TCBB.2005.32
Usage of this product signifies your acceptance of the Terms of Use.