The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January-March (2010 vol.7)
pp: 50-63
ABSTRACT
While clustering genes remains one of the most popular exploratory tools for expression data, it often results in a highly variable and biologically uninformative clusters. This paper explores a data fusion approach to clustering microarray data. Our method, which combined expression data and Gene Ontology (GO)-derived information, is applied on a real data set to perform genome-wide clustering. A set of novel tools is proposed to validate the clustering results and pick a fair value of infusion coefficient. These tools measure stability, biological relevance, and distance from the expression-only clustering solution. Our results indicate that a data-fusion clustering leads to more stable, biologically relevant clusters that are still representative of the experimental data.
INDEX TERMS
Clustering expression data, Gene Ontology, genomic data fusion, semantic similarity, cluster stability, knowledge-based validation.
CITATION
Rafal Kustra, "Data-Fusion in Clustering Microarray Data: Balancing Discovery and Interpretability", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 1, pp. 50-63, January-March 2010, doi:10.1109/TCBB.2007.70267
REFERENCES
[1] M. Ashburner, C. Ball, J. Blake, D. Botstein, J. Butler, H. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, M. Harris, D. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. Matese, J. Richardson, M. Ringwald, G. Rubin, and G. Sherlock, "Gene Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium," Nature Genetics, vol. 25, no. 1, pp. 25-29, 2000.
[2] B. Breitkreutz, C. Stark, and M. Tyers, "The GRID: The General Repository for Interaction Datasets," Genome Biology, vol. 3, no. 12, 2002.
[3] L. Salwinski, C. Miller, A. Smith, F. Pettit, J. Bowie, and D. Eisenberg, "The Database of Interacting Proteins: 2004 Update," Nucleic Acids Research, vol. 32, pp. 449-451, 2004.
[4] M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori, "The KEGG Resource for Deciphering the Genome," Nucleic Acids Research, vol. 32, database issue, pp. 277-280, 2004.
[5] E. Wingender, X. Chen, R. Hehl, I. Karas, I. Liebich, V. Matys, T. Meinhardt, M. Pruss, I. Reuter, and F. Schacherer, "TRANSFAC: An Integrated System for Gene Expression Regulation," Nucleic Acids Research, vol. 28, pp. 316-331, 2000.
[6] J. Zhu and M. Zhang, "SCPD: A Promoter Database of the Yeast Saccharomyces Cerevisiae," Bioinformatics, vol. 17, pp. 607-611, 1999.
[7] M. Galperin, "The Molecular Biology Database Collection: 2005 Update," Nucleic Acids Research, vol. 33, database issue, pp. 5-24, 2005.
[8] E. Marcotte, M. Pellegrini, M. Thompson, T. Yeates, and D. Eisenberg, "A Combined Algorithm for Genome-Wide Prediction of Protein Function," Nature, vol. 402, pp. 83-86, 1999.
[9] C. Brun, F. Chevene, D. Martin, J. Wojcik, A. Guénoche, and B. Jacq, "Functional Classification of Proteins for the Prediction of Cellular Function from a Protein-Protein Interaction Network," Genome Biology, vol. 5, pp. R6.1-R6.13, 2003.
[10] Y. Chen and D. Xu, "Global Protein Function Annotation through Mining Genome-Scale Data in Yeast Saccharomyces Cerevisiae," Nucleic Acids Research, vol. 32, no. 21, pp. 6414-6424, 2004.
[11] G. Lanckriet, T. De Brie, N. Cristianini, M. Jordan, and W. Noble, "A Statistical Framework for Genomic Data Fusion," Bioinformatics, vol. 20, no. 16, pp. 2626-2635, 2004.
[12] P. Kemmeren, T. Kockelkorn, T. Bijma, R. Donders, and F. Holstege, "Predicting Gene Function through Systematic Analysis and Quality Assessment of High-Throughput Data," Bioinformatics, vol. 21, no. 8, pp. 1644-1652, 2005.
[13] N. Speer, C. Spieth, and A. Zell, "A Memetic Co-Clustering Algorithm for Gene Expression Profiles and Biological Annotation," Proc. Congress on Evolutionary Computation (CEC '04), vol. 2, pp. 1631-1638, 2004.
[14] R. Kustra and A. Zagdański, "Incorporating Gene Ontology in Clustering Gene Expression Data," Proc. 19th IEEE Symp. Computer-Based Medical Systems (CBMS), 2006.
[15] L. Wu, T. Hughes, A. Davierwala, M. Robinson, R. Stoughton, and S. Altschuler, "Large Scale Prediction of Saccharomyces Cerevisiae Gene Function Using Overlapping Transcriptional Clusters," Nature Genetics, vol. 31, pp. 255-260, July 2002.
[16] A. Alizadeh et al., "Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling," Nature, vol. 403, pp. 503-511, Feb. 2000.
[17] N. Garge, G. Page, A. Sprague, B. Gorman, and D. Allison, "Reproducible Clusters from Microarray Research: Whither," BMC Bioinformatics, vol. 6, pp. 137-142, 2005.
[18] T. Lange, V. Roth, M. Braun, and J. Buhmann, "Stability-Based Validation of Clustering Solutions," Neural Computation, vol. 16, pp. 1299-1323, 2004.
[19] H.W. Kuhn, "The Hungarian Method for the Assignment Problem," Naval Research Logistics Quarterly, vol. 2, pp. 83-97, 1955.
[20] P. Khatri and S. Draghici, "Ontological Analysis of Gene Expression Data: Current Tools, Limitations, and Open Problems," Bioinformatics, vol. 21, no. 18, pp. 3587-3595, 2005.
[21] F. Al-Shahrour, R. Díaz-Uriarte, and J. Dopazo, "FatiGO: A Web Tool for Finding Significant Associations of Gene Ontology Terms with Groups of Genes," Bioinformatics, vol. 20, no. 4, pp. 578-580, 2004.
[22] M. Robinson, J. Grigull, N. Mohammad, and T. Hughes, "FunSpec: A Web-Based Cluster Interpreter for Yeast," BMC Bioinformatics, vol. 3, no. 35, pp. 1-5, 2002.
[23] T. Beissbarth and T. Speed, "GOstat: Find Statistically Overrepresented Gene Ontologies within a Group of Genes," Bioinformatics, vol. 20, no. 9, pp. 1464-1465, 2004.
[24] P. Lord, R. Stevens, A. Brass, and C. Goble, "Investigating Semantic Similarity Measures across the Gene Ontology: The Relationship between Sequence and Annotation," Bioinformatics, vol. 19, no. 10, pp. 1275-1283, 2003.
[25] P. Lord, R. Stevens, A. Brass, and C. Goble, "Semantic Similarity Measures as Tools for Exploring the Gene Ontology," Proc. Pacific Symp. Biocomputing (PSB '03), vol. 8, pp. 601-612, 2003.
[26] F. Azuaje and J. Dopazo, Data Analysis and Visualization in Genomics and Proteomics. John Wiley & Sons, 2005.
[27] G. Lanckriet, T. De Bie, N. Cristianini, M. Jordan, and W. Noble, "A Statistical Framework for Genomic Data Fusion," Bioinformatics, vol. 20, no. 16, pp. 2626-2635, 2004.
[28] J. Kasturi and R. Acharya, "Clustering of Diverse Genomic Data Using Information Fusion," Bioinformatics, vol. 21, no. 4, pp. 423-429, 2005.
[29] M. Ehrig, P. Haase, M. Hefke, and N. Stojanovic, "Similarity for Ontologies—A Comprehensive Framework," Proc. 13th European Conf. Information Systems, Information Systems in a Rapidly Changing Economy (ECIS), 2005.
[30] E. Pekalska and R. Duin, "On Combining Dissimilarity Representations," Multiple Classifier Systems, J. Kittler and F. Roli, eds., vol. 2096, pp. 359-368, Springer Verlag, 2001
[31] D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer, "Co-Clustering of Biological Networks and Gene Expression Data," Bioinformatics, vol. 18, pp. S145-S154, 2002.
[32] J. Cheng, M. Cline, J. Martin, D. Finkelstein, T. Awad, D. Kulp, and M. Siani-Rose, "A Knowledge-Based Clustering Algorithm Driven by Gene Ontology," J. Biopharmaceutical Statistics, vol. 14, no. 3, pp. 687-700, 2004.
[33] J. Jiang and D. Conrath, "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy," Proc. Int'l Conf. Research in Computational Linguistics (ROCLING), 1998.
[34] P. Resnik, "Using Information Content to Evaluate Semantic Similarity in a Taxonomy," Proc. 14th Int'l Joint Conf. Artificial Intelligence (IJCAI '95), pp. 448-453, 1995.
[35] D. Lin, "An Information-Theoretic Definition of Similarity," Proc. 15th Int'l Conf. Machine Learning (ICML '98), pp. 296-304, 1998.
[36] H. Wang, F. Azuje, O. Bodenreider, and J. Dopazo, "Gene Expression Correlation and Gene Ontology-Based Similarity: An Assessment of Quantitative Relationships," Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology (CIBCB '04), pp. 25-31, 2004.
[37] J. Sevilla, V. Segura, A. Podhorski, E. Guruceaga, J. Mato, L. Martínez-Cruz, J. Corrales, and A. Rubio, "Correlation between Gene Expression and GO Semantic Similarity," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 4, pp. 330-338, Oct.-Dec. 2005.
[38] G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller, "Introduction to WordNet: An On-Line Lexical Database," Int'l J. Lexicography, vol. 3, pp. 235-312, 1990.
[39] P. Resnik, "Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language," J. Artificial Intelligence Research, vol. 11, pp. 95-130, 1999.
[40] F. Azuaje, H. Wang, and O. Bodenreider, "Ontology-Driven Similarity Approaches to Supporting Gene Functional Assessment," Proc. Eighth Ann. Bio-Ontologies Meeting, http:/bio-ontologies.man.ac.uk/, 2005.
[41] J. Kleinberg, "An Impossibility Theorem for Clustering," Proc. 15th Conf. Neural Information Processing Systems (NIPS '02), ser. Advances in Neural Information Processing Systems (NIPS), S. Becker, S. Thrun, and K. Obermayer eds., http://www.cs. cornell.edu/home/kleinber nips15.ps, 2002.
[42] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, "On Clustering Validation Techniques," J. Intelligent Information Systems, vol. 17, pp. 107-145, 2001.
[43] J. Handl, J. Knowles, and D. Kell, "Computational Cluster Validation in Post-Genomic Data Analysis," Bioinformatics, vol. 21, no. 15, pp. 3201-3212, 2005.
[44] C. Giurcăneanu and I. Tăbuş, "Cluster Structure Inference Based on Clustering Stability with Applications to Microarray Data Analysis," J. Applied Signal Processing, vol. 1, pp. 64-80, 2004.
[45] A. Ben-Hur, A. Elisseeff, and I. Guyon, "A Stability Based Method for Discovering Structure in Clustered Data," Proc. Pacific Symp. Biocomputing (PSB '02), vol. 7, pp. 6-17, 2002.
[46] E. Levine and E. Domany, "Resampling Method for Unsupervised Estimation of Cluster Validity," Neural Computation, vol. 13, pp. 2573-2593, 2001.
[47] F. Gibbons and F. Roth, "Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotations," Genome Research, vol. 12, no. 10, pp. 1574-1581, 2002.
[48] N. Bolshakova, F. Azuje, and P. Cunningham, "A Knowledge-Driven Approach to Cluster Validity Assessment," Bioinformatics, vol. 21, no. 10, pp. 2546-2547, 2005.
[49] B. Titz, M. Schlesner, and P. Uetz, "What Do We Learn from High-Throughput Protein Interaction Data?" Expert Rev. Proteomics, vol. 1, no. 1, pp. 111-121, 2004.
[50] L. Wu, T. Hughes, A. Davierwala, M. Robinson, R. Stoughton, and S. Altschuler, "Large-Scale Prediction of Saccharomyces Cerevisiae Gene Function Using Overlapping Transcriptional Clusters," Nature Genetics, vol. 31, no. 3, pp. 137-142, 2002.
[51] T. Hughes, M. Marton, A. Jones, C. Roberts, R. Stoughton, C. Armour, H. Bennett, E. Coffey, H. Dai, Y. He, M. Kidd, A. King, M. Meyer, D. Slade, P. Lum, S. Stepaniants, D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. Friend, "Functional Discovery via a Compendium of Expression Profiles," Cell, vol. 102, pp. 109-126, 2000.
[52] C. Roberts, B. Nelson, M. Marton, R. Stoughton, M. Meyer, H. Bennett, Y. He, H. Dai, W. Walker, T. Hughes, M. Tyers, C. Boone, and S. Friend, "Signaling and Circuitry of Multiple MAPK Pathways Revealed by a Matrix of Global Gene Expression Profiles," Science, vol. 287, pp. 873-880, 2000.
[53] P. Spellman, G. Sherlock, M. Zhang, V. Iyer, K. Anders, M. Eisen, P. Brown, D. Botstein, and B. Futcher, "Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization," Molecular Biology of the Cell, vol. 9, pp. 3273-3297, 1998.
[54] S. Chu et al., "The Transcriptional Program of Sporulation in Budding Yeast," Science, vol. 282, pp. 699-705, 1998.
[55] O. Troyanskaya, M. Cantor, G. Sherlock, M. Eisen, P. Brown, and D. Botstein, "Imputing Missing Data for Gene Expression Arrays," Bioinformatics, vol. 17, no. 6, pp. 520-525, 2001.
[56] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990.
[57] T. Speed, Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC, 2003.
[58] F. Azuaje and O. Bodenreider, "Incorporating Ontology-Driven Similarity Knowledge into Functional Genomics: An Exploratory Study," Proc. Fourth IEEE Symp. Bioinformatics and Bioeng. (BIBE), 2004.
[59] G. Milligan and D. Schilling, "Asymptotic and Finite Sample Characteristics of Four External Criterion Measures," Multivariate Behavioral Research, vol. 20, pp. 97-109, 1985.
[60] S. Dudoit and J. Fridlyand, "Bagging to Improve the Accuracy of a Clustering Procedure," Bioinformatics, vol. 19, no. 9, pp. 1090-1099, 2003.
27 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool