This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Data-Fusion in Clustering Microarray Data: Balancing Discovery and Interpretability
January-March 2010 (vol. 7 no. 1)
pp. 50-63
While clustering genes remains one of the most popular exploratory tools for expression data, it often results in a highly variable and biologically uninformative clusters. This paper explores a data fusion approach to clustering microarray data. Our method, which combined expression data and Gene Ontology (GO)-derived information, is applied on a real data set to perform genome-wide clustering. A set of novel tools is proposed to validate the clustering results and pick a fair value of infusion coefficient. These tools measure stability, biological relevance, and distance from the expression-only clustering solution. Our results indicate that a data-fusion clustering leads to more stable, biologically relevant clusters that are still representative of the experimental data.

[1] M. Ashburner, C. Ball, J. Blake, D. Botstein, J. Butler, H. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, M. Harris, D. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. Matese, J. Richardson, M. Ringwald, G. Rubin, and G. Sherlock, "Gene Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium," Nature Genetics, vol. 25, no. 1, pp. 25-29, 2000.
[2] B. Breitkreutz, C. Stark, and M. Tyers, "The GRID: The General Repository for Interaction Datasets," Genome Biology, vol. 3, no. 12, 2002.
[3] L. Salwinski, C. Miller, A. Smith, F. Pettit, J. Bowie, and D. Eisenberg, "The Database of Interacting Proteins: 2004 Update," Nucleic Acids Research, vol. 32, pp. 449-451, 2004.
[4] M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori, "The KEGG Resource for Deciphering the Genome," Nucleic Acids Research, vol. 32, database issue, pp. 277-280, 2004.
[5] E. Wingender, X. Chen, R. Hehl, I. Karas, I. Liebich, V. Matys, T. Meinhardt, M. Pruss, I. Reuter, and F. Schacherer, "TRANSFAC: An Integrated System for Gene Expression Regulation," Nucleic Acids Research, vol. 28, pp. 316-331, 2000.
[6] J. Zhu and M. Zhang, "SCPD: A Promoter Database of the Yeast Saccharomyces Cerevisiae," Bioinformatics, vol. 17, pp. 607-611, 1999.
[7] M. Galperin, "The Molecular Biology Database Collection: 2005 Update," Nucleic Acids Research, vol. 33, database issue, pp. 5-24, 2005.
[8] E. Marcotte, M. Pellegrini, M. Thompson, T. Yeates, and D. Eisenberg, "A Combined Algorithm for Genome-Wide Prediction of Protein Function," Nature, vol. 402, pp. 83-86, 1999.
[9] C. Brun, F. Chevene, D. Martin, J. Wojcik, A. Guénoche, and B. Jacq, "Functional Classification of Proteins for the Prediction of Cellular Function from a Protein-Protein Interaction Network," Genome Biology, vol. 5, pp. R6.1-R6.13, 2003.
[10] Y. Chen and D. Xu, "Global Protein Function Annotation through Mining Genome-Scale Data in Yeast Saccharomyces Cerevisiae," Nucleic Acids Research, vol. 32, no. 21, pp. 6414-6424, 2004.
[11] G. Lanckriet, T. De Brie, N. Cristianini, M. Jordan, and W. Noble, "A Statistical Framework for Genomic Data Fusion," Bioinformatics, vol. 20, no. 16, pp. 2626-2635, 2004.
[12] P. Kemmeren, T. Kockelkorn, T. Bijma, R. Donders, and F. Holstege, "Predicting Gene Function through Systematic Analysis and Quality Assessment of High-Throughput Data," Bioinformatics, vol. 21, no. 8, pp. 1644-1652, 2005.
[13] N. Speer, C. Spieth, and A. Zell, "A Memetic Co-Clustering Algorithm for Gene Expression Profiles and Biological Annotation," Proc. Congress on Evolutionary Computation (CEC '04), vol. 2, pp. 1631-1638, 2004.
[14] R. Kustra and A. Zagdański, "Incorporating Gene Ontology in Clustering Gene Expression Data," Proc. 19th IEEE Symp. Computer-Based Medical Systems (CBMS), 2006.
[15] L. Wu, T. Hughes, A. Davierwala, M. Robinson, R. Stoughton, and S. Altschuler, "Large Scale Prediction of Saccharomyces Cerevisiae Gene Function Using Overlapping Transcriptional Clusters," Nature Genetics, vol. 31, pp. 255-260, July 2002.
[16] A. Alizadeh et al., "Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling," Nature, vol. 403, pp. 503-511, Feb. 2000.
[17] N. Garge, G. Page, A. Sprague, B. Gorman, and D. Allison, "Reproducible Clusters from Microarray Research: Whither," BMC Bioinformatics, vol. 6, pp. 137-142, 2005.
[18] T. Lange, V. Roth, M. Braun, and J. Buhmann, "Stability-Based Validation of Clustering Solutions," Neural Computation, vol. 16, pp. 1299-1323, 2004.
[19] H.W. Kuhn, "The Hungarian Method for the Assignment Problem," Naval Research Logistics Quarterly, vol. 2, pp. 83-97, 1955.
[20] P. Khatri and S. Draghici, "Ontological Analysis of Gene Expression Data: Current Tools, Limitations, and Open Problems," Bioinformatics, vol. 21, no. 18, pp. 3587-3595, 2005.
[21] F. Al-Shahrour, R. Díaz-Uriarte, and J. Dopazo, "FatiGO: A Web Tool for Finding Significant Associations of Gene Ontology Terms with Groups of Genes," Bioinformatics, vol. 20, no. 4, pp. 578-580, 2004.
[22] M. Robinson, J. Grigull, N. Mohammad, and T. Hughes, "FunSpec: A Web-Based Cluster Interpreter for Yeast," BMC Bioinformatics, vol. 3, no. 35, pp. 1-5, 2002.
[23] T. Beissbarth and T. Speed, "GOstat: Find Statistically Overrepresented Gene Ontologies within a Group of Genes," Bioinformatics, vol. 20, no. 9, pp. 1464-1465, 2004.
[24] P. Lord, R. Stevens, A. Brass, and C. Goble, "Investigating Semantic Similarity Measures across the Gene Ontology: The Relationship between Sequence and Annotation," Bioinformatics, vol. 19, no. 10, pp. 1275-1283, 2003.
[25] P. Lord, R. Stevens, A. Brass, and C. Goble, "Semantic Similarity Measures as Tools for Exploring the Gene Ontology," Proc. Pacific Symp. Biocomputing (PSB '03), vol. 8, pp. 601-612, 2003.
[26] F. Azuaje and J. Dopazo, Data Analysis and Visualization in Genomics and Proteomics. John Wiley & Sons, 2005.
[27] G. Lanckriet, T. De Bie, N. Cristianini, M. Jordan, and W. Noble, "A Statistical Framework for Genomic Data Fusion," Bioinformatics, vol. 20, no. 16, pp. 2626-2635, 2004.
[28] J. Kasturi and R. Acharya, "Clustering of Diverse Genomic Data Using Information Fusion," Bioinformatics, vol. 21, no. 4, pp. 423-429, 2005.
[29] M. Ehrig, P. Haase, M. Hefke, and N. Stojanovic, "Similarity for Ontologies—A Comprehensive Framework," Proc. 13th European Conf. Information Systems, Information Systems in a Rapidly Changing Economy (ECIS), 2005.
[30] E. Pekalska and R. Duin, "On Combining Dissimilarity Representations," Multiple Classifier Systems, J. Kittler and F. Roli, eds., vol. 2096, pp. 359-368, Springer Verlag, 2001
[31] D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer, "Co-Clustering of Biological Networks and Gene Expression Data," Bioinformatics, vol. 18, pp. S145-S154, 2002.
[32] J. Cheng, M. Cline, J. Martin, D. Finkelstein, T. Awad, D. Kulp, and M. Siani-Rose, "A Knowledge-Based Clustering Algorithm Driven by Gene Ontology," J. Biopharmaceutical Statistics, vol. 14, no. 3, pp. 687-700, 2004.
[33] J. Jiang and D. Conrath, "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy," Proc. Int'l Conf. Research in Computational Linguistics (ROCLING), 1998.
[34] P. Resnik, "Using Information Content to Evaluate Semantic Similarity in a Taxonomy," Proc. 14th Int'l Joint Conf. Artificial Intelligence (IJCAI '95), pp. 448-453, 1995.
[35] D. Lin, "An Information-Theoretic Definition of Similarity," Proc. 15th Int'l Conf. Machine Learning (ICML '98), pp. 296-304, 1998.
[36] H. Wang, F. Azuje, O. Bodenreider, and J. Dopazo, "Gene Expression Correlation and Gene Ontology-Based Similarity: An Assessment of Quantitative Relationships," Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology (CIBCB '04), pp. 25-31, 2004.
[37] J. Sevilla, V. Segura, A. Podhorski, E. Guruceaga, J. Mato, L. Martínez-Cruz, J. Corrales, and A. Rubio, "Correlation between Gene Expression and GO Semantic Similarity," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 4, pp. 330-338, Oct.-Dec. 2005.
[38] G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller, "Introduction to WordNet: An On-Line Lexical Database," Int'l J. Lexicography, vol. 3, pp. 235-312, 1990.
[39] P. Resnik, "Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language," J. Artificial Intelligence Research, vol. 11, pp. 95-130, 1999.
[40] F. Azuaje, H. Wang, and O. Bodenreider, "Ontology-Driven Similarity Approaches to Supporting Gene Functional Assessment," Proc. Eighth Ann. Bio-Ontologies Meeting, http:/bio-ontologies.man.ac.uk/, 2005.
[41] J. Kleinberg, "An Impossibility Theorem for Clustering," Proc. 15th Conf. Neural Information Processing Systems (NIPS '02), ser. Advances in Neural Information Processing Systems (NIPS), S. Becker, S. Thrun, and K. Obermayer eds., http://www.cs. cornell.edu/home/kleinber nips15.ps, 2002.
[42] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, "On Clustering Validation Techniques," J. Intelligent Information Systems, vol. 17, pp. 107-145, 2001.
[43] J. Handl, J. Knowles, and D. Kell, "Computational Cluster Validation in Post-Genomic Data Analysis," Bioinformatics, vol. 21, no. 15, pp. 3201-3212, 2005.
[44] C. Giurcăneanu and I. Tăbuş, "Cluster Structure Inference Based on Clustering Stability with Applications to Microarray Data Analysis," J. Applied Signal Processing, vol. 1, pp. 64-80, 2004.
[45] A. Ben-Hur, A. Elisseeff, and I. Guyon, "A Stability Based Method for Discovering Structure in Clustered Data," Proc. Pacific Symp. Biocomputing (PSB '02), vol. 7, pp. 6-17, 2002.
[46] E. Levine and E. Domany, "Resampling Method for Unsupervised Estimation of Cluster Validity," Neural Computation, vol. 13, pp. 2573-2593, 2001.
[47] F. Gibbons and F. Roth, "Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotations," Genome Research, vol. 12, no. 10, pp. 1574-1581, 2002.
[48] N. Bolshakova, F. Azuje, and P. Cunningham, "A Knowledge-Driven Approach to Cluster Validity Assessment," Bioinformatics, vol. 21, no. 10, pp. 2546-2547, 2005.
[49] B. Titz, M. Schlesner, and P. Uetz, "What Do We Learn from High-Throughput Protein Interaction Data?" Expert Rev. Proteomics, vol. 1, no. 1, pp. 111-121, 2004.
[50] L. Wu, T. Hughes, A. Davierwala, M. Robinson, R. Stoughton, and S. Altschuler, "Large-Scale Prediction of Saccharomyces Cerevisiae Gene Function Using Overlapping Transcriptional Clusters," Nature Genetics, vol. 31, no. 3, pp. 137-142, 2002.
[51] T. Hughes, M. Marton, A. Jones, C. Roberts, R. Stoughton, C. Armour, H. Bennett, E. Coffey, H. Dai, Y. He, M. Kidd, A. King, M. Meyer, D. Slade, P. Lum, S. Stepaniants, D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. Friend, "Functional Discovery via a Compendium of Expression Profiles," Cell, vol. 102, pp. 109-126, 2000.
[52] C. Roberts, B. Nelson, M. Marton, R. Stoughton, M. Meyer, H. Bennett, Y. He, H. Dai, W. Walker, T. Hughes, M. Tyers, C. Boone, and S. Friend, "Signaling and Circuitry of Multiple MAPK Pathways Revealed by a Matrix of Global Gene Expression Profiles," Science, vol. 287, pp. 873-880, 2000.
[53] P. Spellman, G. Sherlock, M. Zhang, V. Iyer, K. Anders, M. Eisen, P. Brown, D. Botstein, and B. Futcher, "Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization," Molecular Biology of the Cell, vol. 9, pp. 3273-3297, 1998.
[54] S. Chu et al., "The Transcriptional Program of Sporulation in Budding Yeast," Science, vol. 282, pp. 699-705, 1998.
[55] O. Troyanskaya, M. Cantor, G. Sherlock, M. Eisen, P. Brown, and D. Botstein, "Imputing Missing Data for Gene Expression Arrays," Bioinformatics, vol. 17, no. 6, pp. 520-525, 2001.
[56] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990.
[57] T. Speed, Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC, 2003.
[58] F. Azuaje and O. Bodenreider, "Incorporating Ontology-Driven Similarity Knowledge into Functional Genomics: An Exploratory Study," Proc. Fourth IEEE Symp. Bioinformatics and Bioeng. (BIBE), 2004.
[59] G. Milligan and D. Schilling, "Asymptotic and Finite Sample Characteristics of Four External Criterion Measures," Multivariate Behavioral Research, vol. 20, pp. 97-109, 1985.
[60] S. Dudoit and J. Fridlyand, "Bagging to Improve the Accuracy of a Clustering Procedure," Bioinformatics, vol. 19, no. 9, pp. 1090-1099, 2003.

Index Terms:
Clustering expression data, Gene Ontology, genomic data fusion, semantic similarity, cluster stability, knowledge-based validation.
Citation:
Rafal Kustra, Adam Zagdański, "Data-Fusion in Clustering Microarray Data: Balancing Discovery and Interpretability," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 50-63, Jan.-March 2010, doi:10.1109/TCBB.2007.70267
Usage of this product signifies your acceptance of the Terms of Use.