This Article 
 Bibliographic References 
 Add to: 
Cancer Classification from Gene Expression Data by NPPC Ensemble
May/June 2011 (vol. 8 no. 3)
pp. 659-671
Santanu Ghorai, MCKV Institute of Engineering, Howrah and Indian Institute of Technology, Kharagpur
Anirban Mukherjee, Indian Institute of Technology, Kharagpur
Sanghamitra Sengupta, University of Calcutta, Kolkata
Pranab K. Dutta, Indian Institute of Technology, Kharagpur
The most important application of microarray in gene expression analysis is to classify the unknown tissue samples according to their gene expression levels with the help of known sample expression levels. In this paper, we present a nonparallel plane proximal classifier (NPPC) ensemble that ensures high classification accuracy of test samples in a computer-aided diagnosis (CAD) framework than that of a single NPPC model. For each data set only, a few genes are selected by using a mutual information criterion. Then a genetic algorithm-based simultaneous feature and model selection scheme is used to train a number of NPPC expert models in multiple subspaces by maximizing cross-validation accuracy. The members of the ensemble are selected by the performance of the trained models on a validation set. Besides the usual majority voting method, we have introduced minimum average proximity-based decision combiner for NPPC ensemble. The effectiveness of the NPPC ensemble and the proposed new approach of combining decisions for cancer diagnosis are studied and compared with support vector machine (SVM) classifier in a similar framework. Experimental results on cancer data sets show that the NPPC ensemble offers comparable testing accuracy to that of SVM ensemble with reduced training time on average.

[1] P.O. Brown and D. Botstein, "Exploring the New World of the Genome with DNA Microarrays," Nature Genetics Supplement, vol. 21, pp. 33-37, 1999.
[2] C. Debouck and P.N. Goodfellow, "DNA Microarrays in Drug Discovery and Development," Nature Genetics Supplement, vol. 21, pp. 48-50, 1999.
[3] D.J. Duggan et al., "Expression Profiling Using cDNA Micro-Arrays," Nature Genetics Supplement, vol. 21, pp. 10-14, 1999.
[4] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, pp. 531-537, 1999.
[5] B. West et al., "Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles," Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 11462-11467, 2001.
[6] G.J. Gordon, R.V. Jenson, L.-L. Hsiao, S.R. Gullans, J.E. Blumenstock, S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, and R. Bueno, "Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelima," Cancer Research, vol. 62, pp. 4936-4967, 2002.
[7] J. Khan, J.S. Wei, and M. Ringner, "Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks," Nature Medicine, vol. 7, no. 6, pp. 673-679, 2001.
[8] E. Keedwell and A. Narayanan, "Discovering Gene Networks with a Neural-Genetic Hybrid," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 3, pp. 231-242, July-Sept. 2005.
[9] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D'Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, and W.R. Sellers, "Gene Expression Correlations of Clinical Prostate Cancer Behavior," Cancer Cell, vol. 1, no. 2, pp. 203-209, 2002.
[10] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene Selection for Cancer Classification Using Support Vector Machines," Machine Learning, vol. 46, pp. 389-422, 2002.
[11] L. Li, C.R. Weinberg, T.A. Darden, and L.G. Pedersen, "Gene Selection for Sample Classification Based on Gene Expression Data: Study of Sensitivity to Choice of Parameters of the GA/KNN Method," Bioinformatics, vol. 17, pp. 1131-1142, 2001.
[12] S. Kim, E.R. Dougherty, J. Barrera, Y. Chen, M. Bittner, and J.M. Trent, "Strong Feature Sets from Small Samples," Computational Biology, vol. 9, pp. 127-146, 2002.
[13] C. Bhattacharyya, L.R. Grate, A. Rizki, D. Radisky, F.J. Molina, M.I. Jordan, M.J. Bissell, and I.S. Mian, "Simultaneous Relevant Feature Identification and Classification in High-Dimensional Spaces: Application to Molecular Profiling Data," Signal Processing, vol. 83, pp. 729-743, 2003.
[14] A. Schulze and J. Downward, "Navigating Gene Expression Using Microarrays—A Technology Review," Natural Cell Biology, vol. 3, no. 8, pp. E190-E195, 2001.
[15] K.M. Borgwardt, S.V.N. Vishwanathan, and H. Kriegel, "Class Prediction from Time Series Gene Expression Profiles Using Dynamical Systems Kernels," Proc. Pacific Symp. Biocomputing, vol. 11, pp. 547-558, 2006.
[16] M. Wilson, J. DeRisi, H.H. Kristensen, P. Imboden, S. Rane, P.O. Brown, and G.K. Schoolnik, "Exploring Drug-Induced Alterations in Gene Expression in Mycobacterium Tuberculosis by Microarray Hybridization," Proc. Nat'l Academy of Sciences USA, vol. 96, no. 22, pp. 12833-12838, 1999.
[17] W.E. Evans and R.K. Guy, "Gene Expression as a Drug Discovery Tool," Nature Genetics, vol. 36, no. 3, pp. 214-215, 2004.
[18] S. Hochreiter and K. Obermayer, Kernel Methods in Computational Biology, B. Scholkopf, K. Tsuda, and J.-P. Vert, eds. p. 323, MIT Press, 2004.
[19] G. Bontempi, "A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 2, pp. 293-300, Apr.-June 2007.
[20] M.A. Hall, "Correlation-Based Feature Selection Machine Learning," PhD thesis, Dept. of Computer Science, Univ. of Waikato, 1998.
[21] R. Tibshirani, T. Hastie, B. Narashiman, and G. Chu, "Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression," Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 6567-6572, 2002.
[22] J. Devore and R. Peck, Statistics: The Exploration and Analysis of Data, third ed. Duxbury Press, 1997.
[23] S. Dudoit, J. Fridlyand, and T.P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, pp. 77-87, 2002.
[24] Y. Lai, B. Wu, L. Chen, and H. Zhao, "Statistical Method for Identifying Differential Gene-Gene Coexpression Patterns," Bioinformatics, vol. 20, pp. 3146-3155, 2004.
[25] P. Broet, A. Lewin, S. Richardson, C. Dalmasso, and H. Magdelenat, "A Mixture Model-Based Strategy for Selecting Sets of Genes in Multiclass Response Microarray Experiments," Bioinformatics, vol. 20, pp. 2562-2571, 2004.
[26] H. Peng, F. Long, and C. Ding, "Feature Selection on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005.
[27] R. Kohavi and G.H. John, "Wrappers for Feature Subset Selection," Artificial Intelligence, vol. 97, nos. 1/2, pp. 273-324, 1997.
[28] C. Huang and C. Wang, "A GA-Based Feature Selection and Parameter Optimization for Support Vector Machines," Expert Systems with Applications, vol. 31, pp. 231-240, 2006.
[29] Y. Tang, Y.-Q. Zhang, and Z. Huang, "Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 365-381, July-Sept. 2007.
[30] M.A.T. Figuiredo and A.K. Jain, "Bayesian Learning of Sparse Classifiers," Proc. Conf. Computer Vision and Pattern Recognition (CVPR '01), vol. 1, pp. I-35-I-41, 2001.
[31] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, second ed. Springer, 2009.
[32] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, "Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays," Statistical Science, vol. 18, no. 1, pp. 104-117, 2003.
[33] T.S. Furey, N. Cristianini, N. Duy, D.W. Bednarski, M. Schummer, and D. Haussler, "Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data," Bioinformatics, vol. 16, no. 10, pp. 906-914, 2000.
[34] B. Krishnapuram, A.J. Hartemink, and L. Carin, "Logistic Regression and RVM for Cancer Diagnosis from Gene Expression Signatures," Proc. IEEE Signal Processing Soc. Workshop Genomic Signal Processing and Statistics (GENSIPS), 2002.
[35] J. Zhu and T. Hastie, "Classification of Gene Microarrays by Penalized Logistic Regression," Biostatistics, vol. 5, no. 2, pp. 427-443, 2004.
[36] G. Fung and O.L. Mangasarian, "Data Selection for Support Vector Machine Classifiers," Proc. ACM SIGKDD, pp. 64-70, 2000.
[37] L. Shen and E.C. Tan, "Dimension Reduction-Based Penalized Logistic Regression for Cancer Classification Using Microarray Data," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 2, pp. 166-175, Apr.-June 2005.
[38] V. Roth, "The Generalized LASSO," IEEE Trans. Neural Networks, vol. 15, no. 1, pp. 16-18, Jan. 2004.
[39] V. Roth, "The Generalized LASSO: A Wrapper Approach to Gene Selection for Microarray Data," Proc. 14th Int'l Conf. Automated Deduction (CADE), pp. 252-255, 2002.
[40] M.A.T. Figuiredo and A.K. Jain, "Bayesian Learning of Sparse Classifiers," Proc. Conf. Computer Vision and Pattern Recognition (CVPR '01), 2001.
[41] B. Krishnapuram, L. Carin, and A.J. Hartemink, "Joint Classifier and Feature Optimization for Cancer Diagnosis Using Gene Expression Data," Proc. Seventh Ann. Int'l. Conf. Computational Molecular Biology, 2003.
[42] D. Ghosh and A.M. Chinnaiyan, "Classification and Selection of Biomarkers in Genomic Data Using Lasso," J. Biomedical Biotechnology, vol. 2, pp. 147-154, 2005.
[43] C. Ambroise and G. McLachlan, "Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data," Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 6562-6566, 2002.
[44] L. Wang, F. Chu, and W. Xie, "Accurate Cancer Classification Using Expressions of Very Few Genes," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 1, pp. 40-53, Jan.-Mar. 2007.
[45] S. Ghorai, A. Mukherjee, and P.K. Dutta, "Nonparallel Plane Proximal Classifier," Signal Processing, vol. 89, pp. 510-522, 2009.
[46] S. Ghorai, S.J. Hossain, A. Mukherjee, and P.K. Dutta, "Newton's Method for Nonparallel Plane Proximal Classifier with Unity Norm Hyperplanes," Signal Processing, vol. 90, pp. 93-104, Jan. 2010.
[47] V. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[48] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, ch. 6, pp. 113-145. Cambridge Univ. Press, 2000.
[49] C.J.C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition," Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 1998.
[50] O.L. Mangasarian and E.W. Wild, "Multisurface Proximal Support Vector Classification via Generalized Eigenvalues," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 1, pp. 69-74, Jan. 2006.
[51] Jayadeva, R. Khemchandani, and S. Chandra, "Twin Support Vector Machines for Pattern Classification," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 5, pp. 905-910, May 2007.
[52] V. Tresp, Handbook for Neural Network Signal Processing, Y.H. Hu and J.-N. Hwang, eds. CRC Press, 2001.
[53] V. Tresp, "A Bayesian Committee Machine," Neural Computation, vol. 12, pp. 2719-2741, 2000.
[54] D. Martinez and G. Millerioux, "Support Vector Committee Machines," Proc. European Symp. Artificial Neural Networks (ESANN '00), pp. 43-48, 2000.
[55] L.I. Kuncheva, Combining Pattern Classifiers—Methods and Algorithms. Wiley Interscience, 2004.
[56] M. Aksela and J. Laaksonen, "Using Diversity of Errors for Selecting Members of a Committee Classifier," Pattern Recognition, vol. 39, pp. 608-623, 2006.
[57] J. Yao, R.M. Summers, and A. Hara, "Optimizing the Support Vector Machines (SVM) Committee Configuration in a Colonic Polyp CAD System," Proc. SPIE Conf., 2005.
[58] Y.-W. Kim and I.-S. Oh, "Classifier Ensemble Selection Using Hybrid Genetic Algorithms," Pattern Recognition Letters, vol. 29, pp. 796-802, 2008.
[59] G. Rogova, "Combining the Results of Several Neural Network Classifiers," Neural Networks, vol. 7, pp. 777-781, 1994.
[60] Y.S. Huang and C.Y. Suen, "A Method of Combining Multiple Experts for the Recognition of Unconstrained Handwritten Numerals," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 1, pp. 90-94, Jan. 1995.
[61] R.L. Haupt and S.E. Haupt, Practical Genetic Algorithms, second ed. Wiley Interscience, 2004.
[62] J. Nocedal and S. Wright, Numerical Optimization, second ed. Springer, 2006.
[63] D.P. Bertsekas, Nonlinear Programming, second ed. Athena Scientific, 1999.
[64] Y.-J. Lee and O.L. Mangasarian, "SSVM: A Smooth Support Vector Machine," Computational Optimization and Application, vol. 20, pp. 5-22, 2001.
[65] R. Kohavi, "A Study of Cross-Validation and Boot Strap for Accuracy Estimation and Model Selection," Proc. Int'l Joint Conf. Artificial Intelligence, 1995.
[66] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain, "Dimensionality Reduction Using Genetic Algorithms," IEEE Trans. Evolutionary Computation, vol. 4, no. 2, pp. 164-171, July 2000.
[67] W.F. Punch, E.D. Goodman, M. Pei, L. Chia-Shun, P. Hovland, and R. Enbody, "Further Research on Feature Selection and Classification Using Genetic Algorithms," Proc. Int'l Conf. Genetic Algorithms, pp. 557-564, 1993.
[68] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissue Probed by Oligonucleotide Arrays," Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 6745-6750, 1999.
[69] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Ray, M.A. Koval, K.W. Last, A. Norton, T.A. Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub, "Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene Expression Profiling and Supervised Machine Learning," Nature Medicine, vol. 8, pp. 68-74, 2002.
[70] X. Chen et al., "Gene Expression Patterns in Human Liver Cancers," Molecular Biology of the Cell, vol. 13, pp. 1929-1939, 2002.
[71] H. Xiong, Y. Zhang, and X.-W. Chen, "Data-Dependent Kernel Machines for Microarray Data Classification," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 4, pp. 583-595, Oct.-Dec. 2007.
[72] O. Troyanskaya et al., "Missing Value Estimation Methods for DNA Microarrays," Bioinformatics, vol. 17, pp. 520-525, 2001.
[73] MATLAB, User's Guide, The MathWorks Inc., http:/www., 1994-2009.
[74] S.R. Gunn, Support Vector Machine Matlab Toolbox, svminfo/, 1998.
[75] T.M. Mitchell, Machine Learning, ch. 5, p. 148. McGraw-Hill, 1997.
[76] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. second ed. John Wiley & Sons, 2001.
[77] A.B. Dydensborg, A.A. Rose, B.J. Wilson, D. Grote, M. Paquet, V. Giguère, P.M. Siegel, and M. Bouchard, "GATA3 Inhibits Breast Cancer Growth and Pulmonary Breast Cancer Metastasis," Oncogene, vol. 28, pp. 2634-2642, July 2009.
[78] J.M. Arnold, D.Y.H. Choong, E.R. Thompson, k. ConFab, N. Waddell, G.J. Lindeman, J.E. Visvader, I.G. Campbell, and G.C. Trench, "Frequent Somatic Mutations of GATA3 in Non-BRCA1/BRCA2 Familial Breast Tumors, but Not in BRCA1-, BRCA2- or Sporadic Breast Tumors," Breast Cancer Research and Treatment, vol. 119, no. 2, pp. 491-496, Feb. 2010.
[79] M. Smid, Y. Wang, J.G.M. Klijn, A.M. Sieuwerts, Y. Zhang, D. Atkins, J.W.M. Martens, and J.A. Foekens, "Genes Associated with Breast Cancer Metastatic to Bone," J. Clinical Oncology, vol. 24, no. 15, pp. 2261-2267, 2006.
[80] K. Tjensvoll, B. Gilje, S. Oltedal, V.F. Shammas, J.T. Kvaly, R. Heikkilä1, and O. Nordgård, "A Small Subgroup of Operable Breast Cancer Patients with Poor Prognosis Identified by Quantitative Real-Time RT-PCR Detection of Mammaglobin A and Trefoil Factor 1 mRNA Expression in Bone Marrow," Breast Cancer Research and Treatment, vol. 116, no. 2, pp. 329-338, 2009.
[81] A. Astanehe, M.R. Finkbeiner, P. Hojabrpour, K. To, A. Fotovati, A. Shadeo, A.L. Stratford, W.L. Lam, I.M. Berquin, V. Duronio, and S.E. Dunn, "The Transcriptional Induction of PIK3CA in Tumor Cells Is Dependent on the Oncoprotein Y-Box Binding Protein-1," Oncogene, vol. 28, pp. 2406-2418, June 2009.
[82] G. Habibi, S. Leung, J.H. Law, K. Gelmon, H. Masoudi, D. Turbin, M. Pollak, T.O. Nielsen, D. Huntsman, and S.E. Dunn, "Redefining Prognostic Factors for Breast Cancer: YB-1 Is a Stronger Predictor of Relapse and Disease-Specific Survival than Estrogen Receptor or HER-2 across All Tumor Subtypes," Breast Cancer Research, vol. 10, no. 5, Oct. 2008.
[83] T. Fujii, A. Kawahara, Y. Basaki, S. Hattori, K. Nakashima, K. Nakano, K. Shirouzu, K. Kohno, T. Yanagawa, H. Yamana, K. Nishio, M. Ono, M. Kuwano, and M. Kage, "Expression of Her-2 and Estrogen Receptor Alpha Depends upon Nuclear Localization of Y-Box Binding Protein-1 in Human Breast Cancers," Cancer Research, vol. 68, no. 5, pp. 1504-1512, 2008.
[84] S. Heck, J. Rom, V. Thewes, N. Becker, B. Blume, H.P. Sinn, U. Deuschle, C. Sohn, A. Schneeweiss, and P. Lichter, "Estrogen-Related Receptor {Alpha} Expression and Function Is Associated with the Transcriptional Coregulator AIB1 in Breast Carcinoma," Cancer Research, vol. 69, no. 12, pp. 5186-5193, 2009.
[85] F. Vasaturo, G.W. Dougherty, and M.L. Cutler, "Ectopic Expression of Rsu-1 Results in Elevation of p21CIP and Inhibits Anchorage-Independent Growth of MCF7 Breast Cancer Cells," Breast Cancer Research and Treatment, vol. 61, no. 1, pp. 69-78, 2000.
[86] F. Fang, M.A. Rycyzyn, and C.V. Clevenger, "Role of c-Myb during Prolactin-Induced Signal Transducer and Activator of Transcription 5a Signaling in Breast Cancer Cells," Endocrinology, vol. 150, no. 4, pp. 1597-1606, 2009.
[87] R.G. Ramsay and T.J. Gonda, "MYB Function in Normal and Cancer Cells," Nature Rev. Cancer, vol. 8, no. 7, pp. 523-524, 2008.
[88] H. Ning, B. Yang, J. Cui, and L. Jing, "Detection of Horizontal Gene Transfer in Bacterial Genomes," Proc. Third Int'l Symp. Optimization and System Biology (OSB '09), pp. 229-236, Sept. 2009.
[89] L. Chen, L. Lu, K. Feng, W. Li, J. Song, L. Zheng, Y. Yuan, Z. Zeng, K. Feng, W. Lu, and Y. Cai, "Multiple Classifier Integration for the Prediction of Protein Structural Classes," J. Computational Chemistry, vol. 30, no. 14, pp. 2248-2254, 2009.
[90] M.T. Cordeiro, U. Barga-Neto, R.M. Noqueira, and E.T. Marques, "Reliable Classifier to Differentiate Primary and Secondary Acute Dengue Infection Based on IgG ELISA," Public Library of Science (PLoS) One, vol. 4, no. 4, Apr. 2009.
[91] B. Lerner, J. Yeshaya, and L. Koushnir, "On the Classification of a Small ~Imbalanced Cytogenetic Image Database," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 2, pp. 204-215, Apr.-June 2007.

Index Terms:
Cancer classification, classifier ensemble, combination of multiple classifiers, microarray data analysis, proximal classifier.
Santanu Ghorai, Anirban Mukherjee, Sanghamitra Sengupta, Pranab K. Dutta, "Cancer Classification from Gene Expression Data by NPPC Ensemble," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 3, pp. 659-671, May-June 2011, doi:10.1109/TCBB.2010.36
Usage of this product signifies your acceptance of the Terms of Use.