The Community for Technology Leaders
RSS Icon
Issue No.02 - April-June (2009 vol.6)
pp: 353-367
Topon Kumar Paul , Toshiba Corporation, Kanagawa
Hitoshi Iba , The University of Tokyo, Japan
In order to get a better understanding of different types of cancers and to find the possible biomarkers for diseases, recently, many researchers are analyzing the gene expression data using various machine learning techniques. However, due to a very small number of training samples compared to the huge number of genes and class imbalance, most of these methods suffer from overfitting. In this paper, we present a majority voting genetic programming classifier (MVGPC) for the classification of microarray data. Instead of a single rule or a single set of rules, we evolve multiple rules with genetic programming (GP) and then apply those rules to test samples to determine their labels with majority voting technique. By performing experiments on four different public cancer data sets, including multiclass data sets, we have found that the test accuracies of MVGPC are better than those of other methods, including AdaBoost with GP. Moreover, some of the more frequently occurring genes in the classification rules are known to be associated with the types of cancers being studied in this paper.
Classifier design and evaluation, data mining, feature extraction, evolutionary computing and genetic algorithm, gene expression, majority voting.
Topon Kumar Paul, Hitoshi Iba, "Prediction of Cancer Class with Majority Voting Genetic Programming Classifier Using Gene Expression Data", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.6, no. 2, pp. 353-367, April-June 2009, doi:10.1109/TCBB.2007.70245
[1] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy Science USA, vol. 96, pp. 6745-6750, 1999.
[2] A. Alizadeh, M. Eisen, R. Davis, C. Ma, I. Lossos, A. Rosenwald, J. Boldrick, H. Sabet, T. Tran, X. Yu, J. Powell, L. Yang, G. Marti, T. Moore, J.J. Hudson, L. Lu, D. Lewis, R. Tibshirani, G. Sherlock, W. Chan, T. Greiner, D. Weisenburger, J. Armitage, R. Warnke, R. Levy, W. Wilson, M. Grever, J. Byrd, D. Botstein, P. Brown, and L. Staudt, “Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling,” Nature, vol. 403, no. 6781, pp. 503-511, 2000.
[3] A. Ben-Dor, R. Shamir, and Z. Yakhini, “Clustering Gene Expression Patterns,” J. Computational Biology, vol. 6, pp. 281-297, 1999.
[4] M.B. Eisen, P.T. Spellman, P. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy Sciences USA, vol. 95, pp. 14 863-14 868, 1998.
[5] A. Bhattacharjee, W. Richards, J. Stauton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Behesti, R. Buneo, M. Gillete, M. Loda, G. Weber, E. Mark, E. Lander, W. Wong, B. Johnson, T. Golub, D. Sugarbaker, and M. Meyerson, “Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses,” Proc. Nat'l Academy Science USA, vol. 98, pp.13 790-13795, 2001.
[6] C. Nutt, D. Mani, R. Betensky, P. Tamayo, J. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M. McLaughlin, T.T. Batchelor, P. Black, A. von Deimling, S. Pomeroy, T. Golub, and D. Louis, “Gene Expression-Based Classification of Malignant Gliomas Correlates Better with Survival than Histological Classification,” Cancer Research, vol. 63, no. 7, pp. 1602-1607, 2003.
[7] D. Singh, P. Febbo, K. Ross, D. Jackson, J. Manola, C. Ladd, P. Tamayo, A. Renshaw, A. D'Amico, J. Richie, E. Lander, M. Loda, P. Kantoff, T. Golub, and W. Sellers, “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell, 1/2203, Mar. 2002.
[8] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, no. 15, pp. 531-537, 1999.
[9] C.H.Q. Ding, “Unsupervised Feature Selection via Two-Way Ordering in Gene Expression Analysis,” Bioinformatics, vol. 19, no. 10, pp. 1259-1266, 2003.
[10] P. Park, M. Pagano, and M. Bonnetti, “A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data,” Proc. Pacific Symp. Bioinformatics (PSB '01), vol. 6, pp. 30-41, 2001.
[11] A. Keller, M. Schummer, L. Hood, and W.L. Ruzzo, “Bayesian Classification of DNA Array Expression Data,” Technical Report UW-CSE-2000-08-01, Dept. of Computer Science and Eng., Univ. of Washington, 2000.
[12] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, “Tissue Classification with Gene Expression Profiles,” J. Computational Biology, vol. 7, pp. 559-584, 2000.
[13] V. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[14] B. Dasarathy, Nearest Neighbor(NN) Norms: NN Pattern Classification Techniques. IEEE CS Press, 1991.
[15] G.-Z. Li, J. Yang, C.-Z. Ye, and D.-Y. Geng, “Degree Prediction of Malignancy in Brain Glioma Using Support Vector Machines,” Computers in Biology and Medicine, vol. 36, pp. 313-325, 2006.
[16] L. Shen and E.C. Tan, “A Generalized Output-Coding Scheme with SVM for Multiclass Microarray Classification,” Proc. Fourth Asia-Pacific Bioinformatics Conf., pp. 179-186, 2006.
[17] R. Blanco, I. Inza, M. Merino, J. Quiroga, and P. Larrañaga, “Feature Selection in Bayesian Classifiers for the Prognosis of Survival of Cirrhotic Patients Treated with TIPS,” J. Biomedical Informatics, vol. 38, no. 5, pp. 376-388, 2005.
[18] F. Pan, B. Wang, X. Hu, and W. Perrizo, “Comprehensive Vertical Sample-Based KNN/LSVM Classification for Gene Expression Analysis,” J. Biomedical Informatics, vol. 37, no. 4, pp. 240-248, 2004.
[19] L. Li, D.M. Umbach, P. Terry, and J.A. Taylor, “Application of the GA/KNN Method to SELDI Proteomics Data,” Bioinformatics, vol. 20, no. 10, pp. 1638-1640, 2004.
[20] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machine,” Machine Learning, vol. 46, nos. 1-3, pp. 389-422, 2002.
[21] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander, and T. Golub, “Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures,” Proc. Nat'l Academy Sciences USA, vol. 98, no. 26, pp. 15 149-15 154, 2001.
[22] C.H. Ooi and P. Tan, “Genetic Algorithms Applied to Multi-Class Prediction for the Analysis of Gene Expression Data,” Bioinformatics, vol. 19, no. 1, pp. 37-44, 2003.
[23] J.J. Liu, G. Cutler, W. Li, Z. Pan, S. Peng, T. Hoey, L. Chen, and X.B. Ling, “Multiclass Cancer Classification and Biomarker Discovery Using GA-Based Algorithms,” Bioinformatics, vol. 21, no. 11, pp. 2691-2697, 2005.
[24] E. Keedwell and A. Narayanan, “Genetic Algorithms for Gene Expression Analysis,” Applications of Evolutionary Computation, Proc. First European Workshop Evolutionary Bioinformatics (EvoBIO '03), pp. 76-86, 2003.
[25] S. Ando and H. Iba, “Classification of Gene Expression Profile Using Combinatory Method of Evolutionary Computation and Machine Learning,” Genetic Programming and Evolvable Machines, vol. 5, pp. 145-156, 2004.
[26] J.M. Deutsch, “Evolutionary Algorithms for Finding Optimal Gene Sets in Microarray Prediction,” Bioinformatics, vol. 19, no. 1, pp. 45-52, 2003.
[27] T.K. Paul and H. Iba, “Extraction of Informative Genes from Microarray Data,” Proc. Genetic and Evolutionary Computation Conf. (GECCO '05), pp. 453-460, 2005.
[28] T.K. Paul and H. Iba, “Gene Selection for Classification of Cancers Using Probabilistic Model Building Genetic Algorithm,” BioSystems, vol. 82, no. 3, pp. 208-225, 2005.
[29] T.K. Paul and H. Iba, “Selection of the Most Useful Subset of Genes for Gene Expression-Based Classification,” Proc. Congress on Evolutionary Computation (CEC '04), pp. 2076-2083, 2004.
[30] T.K. Paul and H. Iba, “Identification of Informative Genes for Molecular Classification Using Probabilistic Model Building Genetic Algorithm,” Lecture Notes in Computer Science, vol. 3102, pp. 414-425, Springer, 2004.
[31] K. Deb and A.R. Reddy, “Reliable Classification of Two-Class Cancer Data Using Evolutionary Algorithms,” BioSystems, vol. 72, pp. 111-129, 2003.
[32] R. Kohavi and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, nos. 1-2, pp. 273-324, , 1997.
[33] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis,” Bioinformatics, vol. 21, pp. 631-643, 2005.
[34] J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992.
[35] J.H. Moore, J.S. Parker, N.J. Olsen, and T.M. Aune, “Symbolic Discriminant Analysis of Microarray Data in Autoimmune Disease,” Genetic Epidemiology, vol. 23, no. 1, pp. 57-69, 2002.
[36] J.-H. Hong and S.-B. Cho, “Lymphoma Cancer Classification Using Genetic Programming with SNR Features,” Proc. Seventh European Conf. (EuroGP '04), pp. 78-88, 2004.
[37] W. Langdon and B. Buxton, “Genetic Programming for Mining DNA Chip Data from Cancer Patients,” Genetic Programming and Evolvable Machines, vol. 5, no. 3, 2004.
[38] J.A. Driscoll, B. Worzel, and D. MacLean, “Classification of Gene Expression Data with Genetic Programming,” Genetic Programming Theory and Practice. pp. 25-42, Kluwer Academic Publishers, 2003.
[39] L. Kuncheva and C. Whitaker, “Measures of Diversity in Classifier Ensembles and Their Relationships with the Ensemble Accuracy,” Machine Learning, vol. 51, pp. 181-207, 2003.
[40] J.R. Koza and D. Andre, “Automatic Discovery of Protein Motifs Using Genetic Programming,” Evolutionary Computation, X.Yao,ed., pp. 171-197, World Scientific, 1999.
[41] W. Banzhaf, P. Nordin, R. Keller, and F. Francone, Genetic Programming—An Introduction. Morgan Kaufmann, 1998.
[42] B. Matthews, “Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme,” Biochimica et Biophysica Acta, vol. 405, pp. 442-451, 1975.
[43] R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” Proc. Int'l Joint Conf. Artificial Intelligence (IJCAI), 1995.
[44] R. Schapire, “A Brief Introduction to Boosting,” Proc. 16th Int'l Joint Conf. Artificial Intelligence (IJCAI '99), www.boosting. org/, 1999.
[45] Y. Freund and R.E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” J.Computer and System Sciences, vol. 55, no. 1, pp. 119-139, 1997.
[46] I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O. Kallioniemi, B. Wilfond, A. Borg, and J. Trent, “Gene-Expression Profiles in Hereditary Breast Cancer,” The New England J. Medicine, vol. 344, no. 8, pp. 539-548, 2001.
[47] S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, no. 457, pp. 77-87, , 2002.
[48] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines, / , 2001.
[49] H. Wang, H. Wang, W. Shen, H. Huang, L. Hu, L. Ramdas, Y. Zhou, W. Liao, G. Fuller, and W. Zhang, “Insulin-Like Growth Factor Binding Protein 2 Enhances Glioblastoma Invasion by Activating Invasion-Enhancing Genes,” Cancer Research, vol. 63, no. 15, pp. 4315-4321, 2003.
[50] E. Eckman, M. Watson, L. Marlow, K. Sambamurti, and C.B. Eckman, “Alzheimer's Disease Beta-Amyloid Peptide Is Increased in Mice Deficient in Endothelin-Converting Enzyme,” J. Biological Chemistry, vol. 278, no. 4, pp. 2081-2084, 2003.
[51] D. Kirchhofer, M. Peek, M. Lipari, K. Billeci, B. Fan, and P. Moran, “Hepsin Activates Pro-Hepatocyte Growth Factor and Is Inhibited by Hepatocyte Growth Factor Activator Inhibitor-1b (HAI-1b) and HAI-2,” FEBS Letters, vol. 579, no. 9, pp. 1945-1950, 2005.
[52] N. Au, A. Gown, M. Cheang, D. Huntsman, E. Yorida, W.M. Elliott, J. Flint, J. English, C. Gilks, and H. Grimes, “P63 Expression in Lung Carcinoma: A Tissue Microarray Study of 408 Cases,” Applied Immunohistochemistry & Molecular Morphology, vol. 12, no. 3, pp. 240-247, 2004.
[53] A. Onn, A.M. Correa, M. Gilcrease, T. Isobe, E. Massarelli, C.D. Bucana, M.S. O'Reilly, W.K. Hong, I.J. Fidler, J.B. Putnam, and R.S. Herbst, “Synchronous Overexpression of Epidermal Growth Factor Receptor and HER2-neu Protein Is a Predictor of Poor Outcome in Patients with Stage I Non-Small Cell Lung Cancer,” Clinical Cancer Research, vol. 10, pp. 136-143, 2004.
10 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool