The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - July-Aug. (2013 vol.10)
pp: 845-857
Pablo A. Jaskowiak , Inst. of Math. & Comput. Sci., Univ. of Sao Paulo, Sao Carlos, Brazil
Ricardo J. G. B. Campello , Inst. of Math. & Comput. Sci., Univ. of Sao Paulo, Sao Carlos, Brazil
Ivan G. Costa , Center of Inf., Fed. Univ. of Pernambuco, Recife, Brazil
ABSTRACT
Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.
INDEX TERMS
Correlation, Equations, Cancer, Gene expression, Clustering algorithms, Time complexity,time course, Proximity measure, distance, similarity, correlation coefficient, clustering, gene expression, cancer
CITATION
Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 4, pp. 845-857, July-Aug. 2013, doi:10.1109/TCBB.2013.9
REFERENCES
[1] D. Jiang, C. Tang, and A. Zhang, "Cluster Analysis for Gene Expression Data: A Survey," IEEE Trans. Knowledge Data Eng., vol. 16, no. 11, pp. 1370-1386, Nov. 2004.
[2] A. Zhang, Advanced Analysis of Gene Expression Microarray Data, first ed. World Scientific, 2006.
[3] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, "Cluster Analysis and Display of Genome-Wide Expression Patterns," Proc. Nat'l Academy Sciences USA, vol. 95, no. 25, pp. 14863-14868, 1998.
[4] L.J. Heyer, S. Kruglyak, and S. Yooseph, "Exploring Expression Data: Identification and Analysis of Coexpressed Genes," Genome Research, vol. 9, no. 11, pp. 1106-1115, 1999.
[5] P. D'haeseleer, "How Does Gene Expression Clustering Work?" Nature Biotechnology, vol. 23, no. 12, pp. 1499-1501, 2005.
[6] G. Kerr, H.J. Ruskin, M. Crane, and P. Doolan, "Techniques for Clustering Gene Expression Data," Computers Biology Medicine, vol. 38, no. 3, pp. 283-293, 2008.
[7] T.R. Golub et al., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, pp. 531-537, 1999.
[8] A.A. Alizadeh et al., "Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling." Nature, vol. 403, no. 6769, pp. 503-511, 2000.
[9] S. Ramaswamy, K.N. Ross, E.S. Lander, and T.R. Golub, "A Molecular Signature of Metastasis in Primary Solid Tumors," Nature Genetics, vol. 33, no. 1, pp. 49-54, Jan. 2003.
[10] J. Lapointe et al., "Gene Expression Profiling Identifies Clinically Relevant Subtypes of Prostate Cancer," Proc Nat'l Academy Sciences USA, vol. 101, no. 3, pp. 811-816, 2004.
[11] I.G. Costa, A. Schönhuth, and A. Schliep, "The Graphical Query Language: A Tool for Analysis of Gene Expression Time-Courses," Bioinformatics, vol. 21, no. 10, pp. 2544-2545, 2005.
[12] J. Ernst, G.J. Nau, and Z. Bar-Joseph, "Clustering Short Time Series Gene Expression Data," Bioinformatics, vol. 21, pp. i159-i168, 2005.
[13] T.J. Hestilow and Y. Huang, "Clustering of Gene Expression Data Based on Shape Similarity," EURASIP J. Bioinformatics Systems Biology, article 12, 2009.
[14] A. Ben-Dor and Z. Yakhini, "Clustering Gene Expression Patterns," Proc. Third Ann. Int'l Conf. Computational Molecular Biology, pp. 33-42, 1999.
[15] E.P. Xing and R.M. Karp, "Cliff: Clustering of High-Dimensional Microarray Data via Iterative Feature Filtering Using Normalized Cuts," Bioinformatics, vol. 17, no. suppl 1, pp. S306-S315, 2001.
[16] R. Sharan, A. Maron-Katz, and R. Shamir, "Click and Expander: A System for Clustering and Visualizing Gene Expression Data," Bioinformatics, vol. 19, no. 14, pp. 1787-1799, 2003.
[17] X. Wu et al., "Top 10 Algorithms in Data Mining," Knowledge Information Systems, vol. 14, no. 1, pp. 1-37, 2008.
[18] J.B. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. Fifth Berkeley Symp. Math. Statistics and Probability, pp. 281-297, 1967.
[19] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice-Hall, 1988.
[20] S. Datta and S. Datta, "Comparisons and Validation of Statistical Clustering Techniques for Microarray Gene Expression Data," Bioinformatics, vol. 19, no. 4, pp. 459-466, 2003.
[21] I.G. Costa, F.A.T.d. Carvalho, and M.C.P. de Souto, "Comparative Analysis of Clustering Methods for Gene Expression Time Course Data," Genetics Molecular Biology, vol. 27, pp. 623-631, 2004.
[22] A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G.C. Tseng, "Evaluation and Comparison of Gene Clustering Methods in Microarray Analysis," Bioinformatics, vol. 22, pp. 2405-2412, 2006.
[23] M. Pirooznia, J. Yang, M.Q. Yang, and Y. Deng, "A Comparative Study of Different Machine Learning Methods on Microarray Gene Expression Data," BMC Genomics, vol. 9, no. Suppl 1, article S13, 2008.
[24] M.C.P. de Souto, I.G. Costa, D. de Araujo, T. Ludermir, and A. Schliep, "Clustering Cancer Gene Expression Data: A Comparative Study," BMC Bioinformatics, vol. 9, no. 1, article 497, 2008.
[25] E. Freyhult, M. Landfors, J. Onskog, T. Hvidsten, and P. Ryden, "Challenges in Microarray Class Discovery: A Comprehensive Examination of Normalization, Gene Selection and Clustering," BMC Bioinformatics, vol. 11, no. 1, article 503, 2010.
[26] R. Xu and D. Wunsch, Clustering. John Wiley & Sons, 2009.
[27] P.N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, first ed. Addison Wesley, 2005.
[28] K.W. Boyack et al., "Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches," PLoS ONE, vol. 6, no. 3, article e18029, 2011.
[29] C. Pesquita, D. Faria, H. Bastos, A.E.N. Ferreira, A.O. Falcão, and F.M. Couto, "Metrics for GO Based Protein Semantic Similarity: A Systematic Evaluation," BMC Bioinformatics, vol. 9, article S4, 2008.
[30] P.A. Jaskowiak, R.J.G.B. Campello, T.F. Covões, and E.R. Hruschka, "A Comparative Study on the Use of Correlation Coefficients for Redundant Feature Elimination," Proc. 11th Brazilian Symp. Neural Networks (SBRN), pp. 13-18, 2010.
[31] R. Balasubramaniyan, E. Hullermeier, N. Weskamp, and J. Kamper, "Clustering of Gene Expression Data Using a Local Shape-Based Similarity Measure," Bioinformatics, vol. 21, pp. 1069-1077, 2005.
[32] Y.S. Son and J. Baek, "A Modified Correlation Coefficient Based Similarity Measure for Clustering Time-Course Gene Expression Data," Pattern Recognition Letters, vol. 29, pp. 232-242, 2008.
[33] A. Schliep, I.G. Costa, C. Steinhoff, and A. Schonhuth, "Analyzing Gene Expression Time-Courses," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 3, pp. 179-193, July-Sept. 2005.
[34] C. Möller-Levet, F. Klawonn, K.-H. Cho, H. Yin, and O. Wolkenhauer, "Clustering of Unevenly Sampled Gene Expression Time-Series Data," Fuzzy Sets and Systems, vol. 152, pp. 49-66, 2005.
[35] K. Pearson, "Contributions to the Mathematical Theory of Evolution. III. Regression, Heredity, and Panmixia," Proc. Royal Soc. London, vol. 59, pp. 69-71, 1895.
[36] A. Brazma and J. Vilo, "Gene Expression Data Analysis," FEBS Letters, vol. 480, no. 1, pp. 17-24, 2000.
[37] R. Loganantharaj, S. Cheepala, and J. Clifford, "Metric for Measuring the Effectiveness of Clustering of DNA Microarray Expression," BMC Bioinformatics, vol. 7, no. Suppl 2, article S5, 2006.
[38] I. Priness, O. Maimon, and I. Ben-Gal, "Evaluation of Gene-Expression Clustering via Mutual Information Distance Measure," BMC Bioinformatics, vol. 8, no. 1, article 111, 2007.
[39] R. Steuer, J. Kurths, C.O. Daub, J. Weise, and J. Selbig, "The Mutual Information: Detecting and Evaluating Dependencies between Variables," Bioinformatics, vol. 18, pp. 231-240, 2002.
[40] I.G. Costa, F.A.T. Carvalho, and M.C.P. de Souto, "Comparative Study on Proximity Indices for Cluster Analysis of Gene Expression Time Series," J. Intelligent Fuzzy Systems, vol. 13, pp. 133-142, 2002.
[41] R. Gentleman, B. Ding, S. Dudoit, and J. Ibrahim, "Distance Measures in DNA Microarray Data Analysis," Bioinformatics and Computational Biology Solutions Using R and Bioconductor, pp. 189-208, Springer, 2005.
[42] R. Giancarlo, G. Lo Bosco, and L. Pinello, "Distance Functions, Clustering Algorithms and Microarray Data Analysis," Proc. Fourth Int'l Conf. Learning and Intelligent Optimization, C. Blum and R. Battiti, eds., pp. 125-138, 2010.
[43] R. Giancarlo, G. Bosco, L. Pinello, and F. Utro, "The Three Steps of Clustering in the Post-Genomic Era: A Synopsis," Proc. Seventh Int'l Conf. Computational Intelligence Methods for Bioinformatics and Biostatistics, pp. 13-30, 2011.
[44] M. Ashburner et al., "Gene Ontology: Tool for the Unification of Biology," Nature Genetics, vol. 25, no. 1, pp. 25-29, May 2000.
[45] P.A. Jaskowiak, R.J.G.B. Campello, and I.G. Costa, "Evaluating Correlation Coefficients for Clustering Gene Expression Profiles of Cancer," Proc. Seventh Brazilian Symp. Bioinformatics (BSB '12), pp. 120-131, 2012.
[46] L. Goodman and W. Kruskal, "Measures of Association for Cross-Classifications," J. Am. Statistical Assoc., vol. 49, pp. 732-764, 1954.
[47] R.J.G.B. Campello and E.R. Hruschka, "On Comparing Two Sequences of Numbers and Its Applications to Clustering Analysis," Information Sciences, vol. 179, no. 8, pp. 1025-1039, 2009.
[48] M.G. Kendall, Rank Correlation Methods, fourth ed. Griffin, 1970.
[49] C. Spearman, "The Proof and Measurement of Association between Two Things," Am. J. Psychology, vol. 100, no. 3/4, pp. 441-471, 1904.
[50] D.J. Hand and R.J. Till, "A Simple Generalisation of the Area under the ROC Curve for Multiple Class Classification Problems," Machine Learning, vol. 45, pp. 171-186, 2001.
[51] C. Pesquita, D. Faria, A.O. Falcão, P. Lord, and F.M. Couto, "Semantic Similarity in Biomedical Ontologies." PLoS Computational Biology, vol. 5, no. 7, article e1000443, July 2009.
[52] K. Faceli, A.C.P.L.F. de Carvalho, and W.A. SilvaJr., "Evaluation of Gene Selection Metrics for Tumor Cell Classification," Genetics and Molecular Biology, vol. 27, pp. 651-657, 2004.
[53] P. Tamayo et al., "Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation," Proc. Nat'l Academy Sciences USA, vol. 96, no. 6, pp. 2907-2912, 1999.
[54] A.P. Gasch et al., "Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes," Molecular Biology of the Cell, vol. 11, no. 12, pp. 4241-4257, 2000.
[55] Z.S. Qin, "Clustering Microarray Gene Expression Data Using Weighted Chinese Restaurant Process," Bioinformatics, vol. 22, no. 16, pp. 1988-1997, 2006.
[56] P.T. Spellman et al., "Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization," Molecular Biology Cell, vol. 9, pp. 3273-3297, 1998.
[57] S. Chu et al., "The Transcriptional Program of Sporulation in Budding Yeast," Science, vol. 282, no. 5389, pp. 699-705, 1998.
[58] Y.H. Yang, S. Dudoit, P. Luu, and T.P. Speed, "Normalization for cDNA Microarray Data," Microarrays: Optical Technologies and Informatics. SPIE, pp. 141-152, 2001.
[59] P. Resnik, "Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language," J. Artificial Intelligence Research, vol. 11, pp. 95-130, 1999.
[60] J.L. Sevilla et al., "Correlation between Gene Expression and GO Semantic Similarity," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 4, pp. 330-338, Oct.-Dec. 2005.
[61] H. Frohlich, N. Speer, A. Poustka, and T. BeiSZbarth, "Gosim - An R-package for Computation of Information Theoretic Go Similarities between Terms and Gene Products," BMC Bioinformatics, vol. 8, no. 1, article 166, 2007.
[62] N. Bolshakova, F. Azuaje, and P. Cunningham, "A Knowledge-Driven Approach to Cluster Validity Assessment," Bioinformatics, vol. 21, no. 10, pp. 2546-2547, 2005.
[63] N. Bolshakova, A. Zamolotskikh, and P. Cunningham, "Comparison of the Data-Based and Gene Ontology-Based Approaches to Cluster Validation Methods for Gene Microarrays," Proc. IEEE 19th Int'l Symp. Computer-Based Medical Systems (CBMS), pp. 539-543, 2006.
[64] G. Milligan, "An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms," Psychometrika, vol. 45, pp. 325-342, 1980.
[65] J. Demšar, "Statistical Comparisons of Classifiers over Multiple Data Sets," J. Machine Learning Research, vol. 7, pp. 1-30, 2006.
8 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool