This Article 
 Bibliographic References 
 Add to: 
Empirical Evidence of the Applicability of Functional Clustering through Gene Expression Classification
May-June 2012 (vol. 9 no. 3)
pp. 788-798
J. Klema, Dept. of Cybern., Czech Tech. Univ. in Prague, Prague, Czech Republic
M. Krejnik, Dept. of Cybern., Czech Tech. Univ. in Prague, Prague, Czech Republic
The availability of a great range of prior biological knowledge about the roles and functions of genes and gene-gene interactions allows us to simplify the analysis of gene expression data to make it more robust, compact, and interpretable. Here, we objectively analyze the applicability of functional clustering for the identification of groups of functionally related genes. The analysis is performed in terms of gene expression classification and uses predictive accuracy as an unbiased performance measure. Features of biological samples that originally corresponded to genes are replaced by features that correspond to the centroids of the gene clusters and are then used for classifier learning. Using 10 benchmark data sets, we demonstrate that functional clustering significantly outperforms random clustering without biological relevance. We also show that functional clustering performs comparably to gene expression clustering, which groups genes according to the similarity of their expression profiles. Finally, the suitability of functional clustering as a feature extraction technique is evaluated and discussed.

[1] D. Chaussabel and A. Sher, "Mining Microarray Expression Data by Literature Profiling," Genome Biology, vol. 3, no. research0055, 2002.
[2] D.W. Huang, B.T. Sherman, Q. Tan, J.R. Collins, W.G. Alvord, J. Roayaei, R. Stephens, M.W. Baseler, H.C. Lane, and R.A. Lempicki, "The David Gene Functional Classification Tool: A Novel Biological Module-Centric Algorithm to Functionally Analyze Large Gene Lists," Genome Biology, vol. 8, no. R183, 2007.
[3] J. Natarajan and J. Ganapathy, "Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature," Bioinformation, vol. 2, no. 5, pp. 185-193, 2007.
[4] K. Ovaska, M. Laakso, and S. Hautaniemi, "Fast Gene Ontology Based Clustering for Microarray Experiments," BioData Mining, vol. 1, no. 11, 2008.
[5] G. Macintyre, J. Bailey, D. Gustafsson, I. Haviv, and A. Kowalczyk, "Using Gene Ontology Annotations in Exploratory Microarray Clustering to Understand Cancer Etiology," Biochemistry, vol. 31, no. 14, pp. 2138-2146, 2010.
[6] P. Khatri, S. Draghici, G.C. Ostermeier, and S.A. Krawetz, "Profiling Gene Expression Using Onto-Express," Genomics, vol. 79, no. 2, pp. 266-270, 2002.
[7] S. Draghici, P. Khatri, R. Martins, G. Ostermeier, and S. Krawetz, "Global Functional Profiling of Gene Expression," Genomics, vol. 81, no. 2, pp. 98-104, 2003.
[8] P. Khatri and S. Draghici, "Ontological Analysis of Gene Expression Data: Current Tools, Limitations, and Open Problems," Bioinformatics, vol. 21, no. 18, pp. 3587-3595, 2005.
[9] D.W.W. Huang, B.T.T. Sherman, and R.A.A. Lempicki, "Bioinformatics Enrichment Tools: Paths Toward the Comprehensive Functional Analysis of Large Gene Lists," Nucleic Acids Research, vol. 37, no. 1, Nov. 2008.
[10] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, "Tissue Classification with Gene Expression Profiles," Proc. Fourth Ann. Int'l Conf. Computational Molecular Biology, pp. 54-64, 2000.
[11] S. Dudoit, J. Fridlyand, and T.P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, no. 457, pp. 77-87, 2002.
[12] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, pp. 531-537, 1999.
[13] J. Lee, J. Lee, M. Park, and S. Song, "An Extensive Evaluation of Recent Classification Tools Applied to Microarray Data," Computational Statistics and Data Analysis, vol. 48, no. 4, pp. 869-885, 2005.
[14] A. Dupuy and R. Simon, "Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting," J. Nat'l Cancer Institute, vol. 99, no. 2, pp. 147-157, 2007.
[15] S. Michiels, S. Koscielny, and C. Hill, "Prediction of Cancer Outcome with Microarrays: A Multiple Random Validation Strategy," The Lancet, vol. 365, no. 9458, pp. 488-492, 2005.
[16] V.G. Tusher, R. Tibshirani, and G. Chu, "Significance Analysis of Microarrays Applied to the Ionizing Radiation Response," Proc. Nat'l Academy of Sciences USA, vol. 98, no. 9, pp. 5116-5121, 2001.
[17] A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander, and J.P. Mesirov, "Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles," Proc. Nat'l Academy of Sciences USA, vol. 102, no. 43, pp. 15545-15550, 2005.
[18] I. Dinu, J. Potter, T. Mueller, Q. Liu, A. Adewale, G. Jhangri, G. Einecke, K. Famulski, P. Halloran, and Y. Yasui, "Improving Gene Set Analysis of Microarray Data by SAM-GS," BMC Bioinformatics, vol. 8, no. 242, 2007.
[19] Y. Hippo, H. Taniguchi, S. Tsutsumi, N. Machida, J. Chong, M. Fukayama, T. Kodama, and H. Aburatani, "Analyzing Gene Expression Data in Terms of Gene Sets: Methodological Issues," Bioinformatics, vol. 23, no. 8, pp. 980-987, 2007.
[20] M. Holec, F. Železný, J. Kléma, and J. Tolar, "Integrating Multiple-Platform Expression Data through Gene Set Features," Proc. Fifth Int'l Symp. Bioinformatics Research and Applications, pp. 5-17, 2009.
[21] M. Holec, F. Železný, J. Kléma, and J. Tolar, "A Comparative Evaluation of Gene Set Analysis Techniques in Predictive Classsification of Expression Samples," Proc. Int'l Conf. Bioinformatics, Computational Biology, Genomics and Chemoinformatics (BCBGC '10), 2010.
[22] F. Rapaport, A. Zinovyev, M. Dutreix, E. Barillot, and J.P. Vert, "Classification of Microarray Data Using Gene Networks," BMC Bioinformatics, vol. 8, no. 35, 2007.
[23] E. Lee, H. Chuang, J. Kim, T. Ideker, and D. Lee, "Inferring Pathway Activity Toward Precise Disease Classification," PLoS Computational Biology, vol. 4, no. e1000217, 2008.
[24] S. Efroni, C.F. Schaefer, and K.H. Buetow, "Identification of Key Processes Underlying Cancer Phenotypes Using Biologic Pathway Analysis," PLoS ONE, vol. 2, no. e425, 2007.
[25] B. Hanczar, M. Courtine, A. Benis, C. Hennegar, K. Clément, and J.-D. Zucker, "Improving Classification of Microarray Data Using Prototype-Based Feature Selection," SIGKDD Explorations Newsletter, vol. 5, no. 2, pp. 23-30, 2003.
[26] A.L. Tarca, S. Draghici, P. Khatri, S.S. Hassan, P. Mittal, J.-s. Kim, C.J. Kim, J.P. Kusanovic, and R. Romero, "A Novel Signaling Pathway Impact Analysis," Bioinformatics, vol. 25, no. 1, pp. 75-82, 2009.
[27] J.P.A. Ioannidis, "Genetic Associations: False or True?," Trends in Molecular Medicine, vol. 9, no. 4, pp. 135-138, 2003.
[28] J.P.A. Ioannidis, "Why Most Published Research Findings are False," PLoS Medicine, vol. 2, no. e124, 2005.
[29] S.Y. Rhee, V. Wood, K. Dolinski, and S. Draghici, "Use and Misuse of the Gene Ontology Annotations," Nature Reviews Genetics, vol. 9, no. 7, pp. 509-515, 2008.
[30] R. Gentleman et al., "Bioconductor: Open Software Development for Computational Biology and Bioinformatics," Genome Biology, vol. 5, no. R80, 2004.
[31] J. Cohen, "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement, vol. 20, pp. 37-46, 1960.
[32] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2010.
[33] J. MacQueen et al., "Some Methods for Classification and Analysis of Multivariate Observations," Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, no. 14, pp. 281-297, 1967.
[34] A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G.C. Tseng, "Evaluation and Comparison of Gene Clustering Methods in Microarray Analysis," Bioinformatics, vol. 22, no. 19, pp. 2405-2412, 2006.
[35] G. Kerr, H. Ruskin, M. Crane, and P. Doolan, "Techniques for Clustering Gene Expression Data," Computers in Biology and Medicine, vol. 38, no. 3, pp. 283-293, 2008.
[36] I. Priness, O. Maimon, and I. Ben-Gal, "Evaluation of Gene-Expression Clustering via Mutual Information Distance Measure," BMC Bioinformatics, vol. 8, no. 111, 2007.
[37] F. De Smet, J. Mathys, K. Marchal, G. Thijs, B. De Moor, and Y. Moreau, "Adaptive Quality-Based Clustering of Gene Expression Profiles," Bioinformatics, vol. 18, no. 5, pp. 735-746, 2002.
[38] L. Kaufman and P. Rousseeuw, Finding Groups in Data an Introduction to Cluster Analysis. Wiley Interscience, 1990.
[39] E. Jones et al., "SciPy: Open Source Scientific Tools for Python," http:/, 2001.
[40] D. Stirewalt et al., "Identification of Genes with Abnormal Expression Changes in Acute Myeloid Leukemia," Genes, Chromosomes and Cancer, vol. 47, no. 1, pp. 8-20, 2008.
[41] A. Tripathi et al., "Gene Expression Abnormalities in Histologically Normal Breast Epithelium of Breast Cancer Patients," Int'l J. Cancer, vol. 122, no. 7, pp. 1557-1566, 2008.
[42] Y. Hippo, H. Taniguchi, S. Tsutsumi, N. Machida, J. Chong, M. Fukayama, T. Kodama, and H. Aburatani, "Global Gene Expression Analysis of Gastric Cancer by Oligonucleotide Microarrays," Cancer Research, vol. 62, no. 1, pp. 233-240, 2002.
[43] W. Freije, F. Castro-Vargas, Z. Fang, S. Horvath, T. Cloughesy, L. Liau, P. Mischel, and S. Nelson, "Gene Expression Profiling of Gliomas Strongly Predicts Survival," Cancer Research, vol. 64, no. 18, pp. 6503-6510, 2004.
[44] T. Bull, C. Coldren, M. Moore, S. Sotto-Santiago, D. Pham, S. Nana-Sinkam, N. Voelkel, and M. Geraci, "Gene Microarray Analysis of Peripheral Blood Cells in Pulmonary Arterial Hypertension," Am. J. Respiratory and Critical Care Medicine, vol. 170, no. 8, pp. 911-919, 2004.
[45] R. Palmer et al., "Pediatric Malignant Germ Cell Tumors Show Characteristic Transcriptome Profiles," Cancer Research, vol. 68, no. 11, pp. 4239-4247, 2008.
[46] C. Best et al., "Molecular Alterations in Primary Prostate Cancer After Androgen Ablation Therapy," Clinical Cancer Research, vol. 11, no. 19, pp. 6823-6834, 2005.
[47] K. Detwiller, N. Fernando, N. Segal, S. Ryeom, P. D'Amore, and S. Yoon, "Analysis of Hypoxia-Related Gene Expression in Sarcomas and Effect of Hypoxia on rna Interference of Vascular Endothelial Cell Growth Factor a," Cancer Research, vol. 65, no. 13, pp. 5881-5889, 2005.
[48] B.J. Carolan, A. Heguy, B.-G. Harvey, P.L. Leopold, B. Ferris, and R.G. Crystal, "Up-Regulation of Expression of the Ubiquitin Carboxyl-Terminal Hydrolase l1 Gene in Human Airway Epithelium of Cigarette Smokers," Cancer Research, vol. 66, no. 22, pp. 10729-10740, 2006.
[49] B. Bolstad, R. Irizarry, M. Åstrand, and T. Speed, "A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias," Bioinformatics, vol. 19, no. 2, pp. 185-193, 2003.
[50] T. Barrett, D. Troup, S. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I. Kim, A. Soboleva, M. Tomashevsky, and R. Edgar, "Ncbi Geo: Mining Tens of Millions of Expression Profiles-Database and Tools Update," Nucleic Acids Research, vol. 35, no. suppl 1, pp. D760-D765, 2007.
[51] R. Kohavi, "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection," Proc. Int'l Joint Conf. Artificial Intelligence, pp. 1137-1143, 1995.
[52] V. Vapnik, The Nature of Statistical Learning Theory. Springer Verlag, 2000.
[53] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
[54] J. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[55] I. Rish, "An Empirical Study of the Naive Bayes Classifier," Proc. IJCAI Workshop Empirical Methods in Artificial Intelligence, pp. 41-46, 2001.
[56] T. Cover and P. Hart, "Nearest Neighbor Pattern Classification," IEEE Trans. Information Theory, vol. 13, no. 1, pp. 21-27, 1967.
[57] M. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares, and D. Haussler, "Knowledge-Based Analysis of Microarray Gene Expression Data by using Support Vector Machines," Proc. Nat'l Academy of Sciences USA, vol. 97, no. 1, pp. 262-267, 2000.
[58] R. Díaz-Uriarte and S. De Andres, "Gene Selection and Classification of Microarray Data Using Random Forest," BMC Bioinformatics, vol. 7, no. 3, 2006.
[59] J. Demšar, B. Zupan, G. Leban, and T. Curk, "Orange: From Experimental Machine Learning to Interactive Data Mining," Proc. Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '04), pp. 537-539, 2004.
[60] F. Wilcoxon, "Individual Comparisons by Ranking Methods," Biometrics, vol. 1, no. 6, pp. 80-83, 1945.
[61] J. Demšar, "Statistical Comparisons of Classifiers over Multiple Data Sets," J. Machine Learning Research, vol. 7, pp. 1-30, 2006.
[62] F.D. Gibbons and F.P. Roth, "Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation," Genome Research, vol. 12, no. 10, pp. 1574-1581, 2002.
[63] M. Friedman, "The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance," J. Am. Statistical Assoc., vol. 32, no. 200, pp. 675-701, 1937.
[64] Y. Saeys, I.n. Inza, and P. Larrañaga, "A Review of Feature Selection Techniques in Bioinformatics," Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007.
[65] A. Gionis, H. Mannila, and P. Tsaparas, "Clustering Aggregation," Proc. 21st Int'l Conf. Data Eng., pp. 341-352, 2005.
[66] A. Strehl and J. Ghosh, "Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions," The J. Machine Learning Research, vol. 3, pp. 583-617, 2003.
[67] P. Glenisson, J. Mathys, and B. de Moor, "Meta-clustering of Gene Expression Data and Literature-Based Information," SIGKDD Explorations, vol. 5, pp. 101-112, 2003.
[68] J. Tomfohr, J. Lu, and T.B. Kepler, "Pathway Level Analysis of Gene Expression Using Singular Value Decomposition," BMC Bioinformatics, vol. 6, no. 225, 2005.

Index Terms:
learning (artificial intelligence),bioinformatics,data analysis,feature extraction,genetics,biological sample,empirical evidence,functional clustering,gene expression classification,biological knowledge,gene-gene interactions,gene expression data,functionally related gene,gene clusters,classifier learning,benchmark data sets,gene expression clustering,feature extraction technique,Clustering algorithms,Bioinformatics,Algorithm design and analysis,Gene expression,Feature extraction,Partitioning algorithms,classification.,Biological prior knowledge,gene expression,gene set analysis,clustering,feature extraction
J. Klema, M. Krejnik, "Empirical Evidence of the Applicability of Functional Clustering through Gene Expression Classification," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 3, pp. 788-798, May-June 2012, doi:10.1109/TCBB.2012.23
Usage of this product signifies your acceptance of the Terms of Use.