The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - Sept.-Oct. (2012 vol.9)
pp: 1422-1431
Herbert Pang , Sch. of Med., Dept. of Biostat. & Bioinf., Duke Univ., Durham, NC, USA
Stephen L. George , Sch. of Med., Dept. of Biostat. & Bioinf., Duke Univ., Durham, NC, USA
Ken Hui , Sch. of Med., Dept. of Internal Med., Yale Univ., New Haven, CT, USA
Tiejun Tong , Dept. of Math., Hong Kong Baptist Univ., Kowloon Tong, China
ABSTRACT
Although many feature selection methods for classification have been developed, there is a need to identify genes in high dimensional data with censored survival outcomes. Traditional methods for gene selection in classification problems have several drawbacks. First, the majority of the gene selection approaches for classification are single-gene based. Second, many of the gene selection procedures are not embedded within the algorithm itself. The technique of random forests has been found to perform well in high-dimensional data settings with survival outcomes. It also has an embedded feature to identify variables of importance. Therefore, it is an ideal candidate for gene selection in high-dimensional data with survival outcomes. In this paper, we develop a novel method based on the random forests to identify a set of prognostic genes. We compare our method with several machine learning methods and various node split criteria using several real data sets. Our method performed well in both simulations and real data analysis. Additionally, we have shown the advantages of our approach over single-gene-based approaches. Our method incorporates multivariate correlations in microarray data for survival outcomes. The described method allows us to better utilize the information available from microarray data with survival outcomes.
INDEX TERMS
pattern classification, biology computing, feature extraction, genetics, genomics, iterative methods, lab-on-a-chip, learning (artificial intelligence), microarray data, gene selection, iterative feature elimination random forests, feature selection methods, censored survival outcomes, classification problems, single-gene based classification, high-dimensional data settings, machine learning methods, node split criteria, Cancer, Genetics, Feature extraction, Random processes, Iterative methods, survival., Cancer, gene selection, iterative feature elimination, microarrays, random forest
CITATION
Herbert Pang, Stephen L. George, Ken Hui, Tiejun Tong, "Gene Selection Using Iterative Feature Elimination Random Forests for Survival Outcomes", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 5, pp. 1422-1431, Sept.-Oct. 2012, doi:10.1109/TCBB.2012.63
REFERENCES
[1] R. Díaz-Uriarte and S. Alvarez de Andrés, "Gene Selection and Classification of Microarray Data Using Random Forest," BMC Bioinformatics, vol. 7, article 3, 2006.
[2] K.B. Duan, J.C. Rajapakse, H. Wang, and F. Azuaje, "Multiple SVM-RFE for Gene Selection in Cancer Classification with Expression Data," IEEE Trans. Nanobioscience, vol. 4, no. 3, pp. 228-234, Sept. 2005.
[3] Y. Tang, Y.Q. Zhang, and Z. Huang, "Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 365-381, July-Sept. 2007.
[4] S. Niijima and Y. Okuno, "Laplacian Linear Discriminant Analysis Approach to Unsupervised Feature Selection," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 6, no. 4, pp. 605-614, Oct.-Dec. 2009.
[5] K.Z. Mao and W. Tang, "Recursive Mahalanobis Separability Measure for Gene Subset Selection," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 1, pp. 266-272, Jan./Feb. 2011.
[6] P.A. Mundra and J.C. Rajapakse, "SVM-RFE with MRMR Filter for Gene Selection," IEEE Trans. Nanobioscience, vol. 9, no. 1, pp. 31-37, Mar. 2010.
[7] L.K. Luo, D.F. Huang, L.J. Ye, Q.F. Zhou, G.F. Shao, and H. Peng, "Improving the Computational Efficiency of Recursive Cluster Elimination for Gene Selection," IEEE/ACM Trans. Computational Biology and Bioinformatics. vol. 8, no. 1, pp. 122-129, Jan./Feb. 2011.
[8] H.M. Bøvelstad, S. Nygård, and O. Borgan, "Survival Prediction from Clinico-Genomic Models-a Comparative Study," BMC Bioinformatics, vol. 10, article 413, 2009.
[9] H. Binder and M. Schumacher, "Allowing for Mandatory Covariates in Boosting Estimation of Sparse High-dimensional Survival Models," BMC Bioinformatics, vol. 9, article 14, 2008.
[10] K. Lee and B. Mallick, "Bayesian Methods for Variable Selection in Survival Models with Application to DNA Microarray Data," Sankhya, vol. 66, pp. 756-778, 2004.
[11] M. Schumacher, H. Binder, and T. Gerds, "Assessment of Survival Prediction Models Based on Microarray Data," Bioinformatics, vol. 23, pp. 1768-1774, 2007.
[12] W. van Wieringen, D. Kun, R. Hampel, and A. Boulesteix, "Survival Prediction Using Gene Expression Data: A Review and Comparison," Computational Statistics and Data Analysis, vol. 53, pp. 1590-1603, 2009.
[13] D. Dunkler, M. Schemper, and G. Heinze, "Gene Selection in Microarray Survival Studies under Possibly Non-Proportional Hazards," Bioinformatics, vol. 26, pp. 784-790, 2010.
[14] L. Breiman, "Random Forests," Machine Learning, vol. 45, pp. 5-32, 2001.
[15] L. Breiman, "How to Use Survival Forests (SFPDV1)," http://oz.berkeley.edu/users/breimanSF_Manual.pdf , May 2010.
[16] H. Ishwaran, U. Kogalur, E. Blackstone, and M. Lauer, "Random Survival Forests," Annals Applied Statistics, vol. 2, pp. 841-860, 2008.
[17] K.L. Lunetta, L.B. Hayward, J. Segal, and P. Van Eerdewegh, "Screening Large-Scale Association Study Data: Exploiting Interactions Using Random Forests," BMC Genetics, vol. 5, article 32, 2004.
[18] F.E. Harrell, R.M. Califf, D.B. Pryor, K.L. Lee, and R.A. Rosati, "Evaluating the Yield of Medical Tests," J. Am. Medical Assoc., vol. 247, pp. 2543-2546, 1982.
[19] M. Segal, "Regression Trees for Censored Data," Biometrics, vol. 44, pp. 35-47, 1988.
[20] T. Hothorn and B. Lausen, "On the Exact Distribution of Maximally Selected Rank Statistics," Computational Statistics and Data Analysis, vol. 43, pp. 121-137, 2003.
[21] D. Naftel, E. Blackstone, and M. Turner, "Conservation of Events," unpublished, 1985.
[22] Y. Lin and Y. Jeon, "Random Forests and Adaptive Nearest Neighbors," J. Am. Statistical Assoc., vol. 101, pp. 578-590, 2006.
[23] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Chapman and Hall, 1984.
[24] B. Ripley, Pattern Recognition and Neural Networks. Cambridge Univ. Press, 1996.
[25] C. Ambroise and G.J. McLachlan, "Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data," Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 6562-6566, 2002.
[26] R. Simon, M. Radmacher, K. Dobbin, and L. McShane, "Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification," J. Nat'l Cancer Inst. vol. 95, pp. 14-18, 2003.
[27] U. Braga-Neto, R. Hashimoto, E. Dougherty, D. Nguyen, and R. Carroll, "Is Cross-Validation Better than Resubstitution for Ranking Genes?," Bioinformatics, vol. 20, pp. 253-258, 2004.
[28] T. Hothorn, K. Hornik, and A. Zeileis, "Unbiased Recursive Partitioning: A Conditional Inference Framework," J. Computational Graphical Statistics, vol. 15, pp. 651-674, 2006.
[29] C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn, "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution," BMC Bioinformatics, vol. 8, article 25, 2007.
[30] R. Ripley, A. Harris, and L. Tarassenko, "Non-Linear Survival Analysis Using Neural Networks," Statistics Medicine, vol. 23, pp. 825-842, 2004.
[31] R. Ripley, "Survnnet: Feed-forward Neural Networks for Survival Analysis," R Package Version 1.1-2, 2004.
[32] T. Hothorn, P. Buhlmann, T. Kneib, M. Schmid, and B. Hofner, "Mboost: Model-Based Boosting," R Package Version 1.0-5, 2008.
[33] P. Buhlmann and T. Hothorn, "Boosting Algorithms: Regularization, Prediction and Model Fitting," Statistical Science, vol. 22, pp. 477-505, 2007.
[34] J. Friedman, "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics, vol. 29, pp. 1189-1232, 2001.
[35] P. Bühlman and B. Yu, "Boosting with the L2 Loss: Regression and Classification," J. Am. Statistical Assoc., vol. 98, pp. 324-339, 2003.
[36] L. Evers, "survpack: Methods for Fitting High-Dimensional Survival Models," R Package Version 0.1-4, 2008.
[37] L. Evers and C. Messow, "Sparse Kernel Methods for High-Dimensional Survival Data," Bioinformatics, vol. 15, pp. 1632-1638, 2008.
[38] A. Rosenwald, G. Wright, A. Wiestner, W.C. Chan, J.M. Connors, E. Campo, R.D. Gascoyne, T.M. Grogan, H.K. Muller-Hermelink, E.B. Smeland, M. Chiorazzi, J.M. Giltnane, E.M. Hurt, H. Zhao, L. Averett, S. Henrickson, L. Yang, J. Powell, W.H. Wilson, E.S. Jaffe, R. Simon, R.D. Klausner, E. Montserrat, F. Bosch, T.C. Greiner, D.D. Weisenburger, W.G. Sanger, B.J. Dave, J.C. Lynch, J. Vose, J.O. Armitage, R.I. Fisher, T.P. Miller, M. LeBlanc, G. Ott, S. Kvaloy, H. Holte, J. Delabie, and L.M. Staudt, "The Proliferation Gene Expression Signature is a Quantitative Integrator of Oncogenic Events that Predicts Survival in Mantle Cell Lymphoma," Cancer Cell, vol. 3, pp. 185-197, 2003.
[39] A. Rosenwald, G. Wright, W.C. Chan, J.M. Connors, E. Campo, R.I. Fisher, R.D. Gascoyne, H.K. Muller-Hermelink, E.B. Smeland, J.M. Giltnane, E.M. Hurt, H. Zhao, L. Averett, L. Yang, W.H. Wilson, E.S. Jaffe, R. Simon, R.D. Klausner, J. Powell, P.L. Duffey, D.L. Longo, T.C. Greiner, D.D. Weisenburger, W.G. Sanger, B.J. Dave, J.C. Lynch, J. Vose, J.O. Armitage, E. Montserrat, A. López-Guillermo, T.M. Grogan, T.P. Miller, M. LeBlanc, G. Ott, S. Kvaloy, J. Delabie, H. Holte, P. Krajci, T. Stokke, and L.M. Staudt, "Lymphoma/Leukemia Molecular Profiling Project: The Use of Molecular Profiling to Predict Survival after Chemotherapy for Diffuse Large-B-Cell Lymphoma," New England J. Medicine, vol. 346, pp. 1937-1947, 2002.
[40] M.J. van de Vijver, Y.D. He, L.J. van't Veer, H. Dai, A.A. Hart, D.W. Voskuil, G.J. Schreiber, J.L. Peterse, C. Roberts, M.J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E.T. Rutgers, S.H. Friend, and R. Bernards, "A Gene-Expression Signature as a Predictor of Survival in Breast Cancer," New England J. Medicine, vol. 347, pp. 1999-2009, 2002.
[41] R. Castillo, J. Mascarenhas, W. Telford, A. Chadburn, S. Friedman, and E. Schattner, "Proliferative Response of Mantle Cell Lymphoma Cells Stimulated by CD40 Ligation and IL-4," Leukemia, vol. 14, pp. 292-298, 2000.
[42] S. Desai, M. Maurin, M. Smith, S. Bolick, S. Dessureault, J. Tao, E. Sotomayor, and K. Wright, "PRDM1 is Required for Mantle Cell Lymphoma Response to Bortezomib," Molecular Cancer Research, vol. 8, pp. 907-918, 2010.
[43] M. Daibata, Y. Nemoto, K. Bandobashi, N. Kotani, M. Kuroda, M. Tsuchiya, H. Okuda, T. Takakuwa, S. Imai, T. Shuin, and H. Taguchi, "Promoter Hypermethylation of the Bone Morphogenetic Protein-6 Gene in Malignant Lymphoma," Clinical Cancer Research, vol. 13, pp. 3528-3535, 2007.
[44] M. Suguro, H. Tagawa, Y. Kagami, M. Okamoto, K. Ohshima, H. Shiku, Y. Morishima, S. Nakamura, and M. Seto, "Expression Profiling Analysis of the CD5+ Diffuse Large B-Cell Lymphoma Subgroup: Development of a CD5 Signature," Cancer Science, vol. 97, pp. 868-874, 2006.
[45] R. Kitai, K. Ishisaka, K. Sato, T. Sakuma, T. Yamauchi, Y. Imamura, H. Matsumoto, and T. Kubota, "Primary Central Nervous System Lymphoma Secretes Monocyte Chemoattractant Protein 1," Medical Molecular Morphology, vol. 40, pp. 18-22, 2007.
[46] H. Husson, E. Carideo, A. Cardoso, S. Lugli, D. Neuberg, O. Munoz, L. de Leval, J. Schultze, and A. Freedman, "MCP-1 Modulates Chemotaxis by Follicular Lymphoma Cells," British J. Haematology, vol. 115, pp. 554-562, 2001.
[47] A. Deutsch, A. Aigelsreiter, E. Steinbauer, M. Frühwirth, H. Kerl, C. Beham-Schmid, H. Schaider, and P. Neumeister, "Distinct Signatures of B-Cell Homeostatic and Activation-Dependent Chemokine Receptors in the Development and Progression of Extragastric MALT Lymphomas," J. Pathology, vol. 215, pp. 431-444, 2008.
[48] E. Kimby, J. Rincon, M. Patarroyo, and H. Mellsted, "Expression of Adhesion Molecules CD11/CD18 (Leu-CAMs, β2-integrins), CD54 (ICAM-1) and CD58 (LFA-3) in B-Chronic Lymphocytic Leukemia," Leukemia Lymphoma, vol. 13, p. 297, 1994.
[49] R. Perlman, W. Schiemann, M. Brooks, H. Lodish, and R. Weinberg, "TGF-Beta-Induced Apoptosis is Mediated by the Adapter Protein Daxx that Facilitates JNK Activation," Nature Cell Biology, vol. 3, pp. 708-714, 2001.
[50] A. Brieger, S. Boehrer, S. Schaaf, D. Nowak, M. Ruthardt, S. Kim, P. Atadja, D. Hoelzer, P. Mitrou, E. Weidmann, and K. Chow, "In bcr-abl-Positive Myeloid Cells Resistant to Conventional Chemotherapeutic Agents, Expression of Par-4 Increases Sensitivity to Imatinib (STI571) and Histone Deacetylase-Inhibitors," Biochemical Pharmacology, vol. 68, pp. 85-93, 2004.
[51] S.S. Wang, M.P. Purdue, J.R. Cerhan, T. Zheng, I. Menashe, B.K. Armstrong, Q. Lan, P. Hartge, A. Kricker, Y. Zhang, L.M. Morton, C.M. Vajdic, T.R. Holford, R.K. Severson, A. Grulich, B.P. Leaderer, S. Davis, W. Cozen, M. Yeager, S.J. Chanock, N. Chatterjee, and N. Rothman, "Common Gene Variants in the Tumor Necrosis Factor (TNF) and TNF Receptor Superfamilies and NF-kB Transcription Factors and Non-Hodgkin Lymphoma Risk," PLoS One, vol. 4, no. 4, p. e5360, Apr. 2009.
[52] P. den Hollander, S. Rayala, D. Coverley, and R. Kumar, "Ciz1, a Novel DNA-Binding Coactivator of the Estrogen Receptor Alpha, Confers Hypersensitivity to Estrogen Action," Cancer Research, vol. 66, pp. 11021-11029, 2006.
[53] P. den Hollander and R. Kumar, "Dynein Light Chain 1 Contributes to Cell Cycle Progression by Increasing Cyclin-dependent Kinase 2 Activity in Estrogen-Stimulated Cells," Cancer Research, vol. 66, pp. 5941-5949, 2006.
[54] S. Edwards, R. Brough, C. Lord, R. Natrajan, R. Vatcheva, D. Levine, J. Boyd, J. Reis-Filho, and A. Ashworth, "Resistance to Therapy Caused by Intragenic Deletion in BRCA2," Nature, vol. 451, pp. 1111-1115, 2008.
[55] R. Castelló, J. Landete, F. España, C. Vázquez, C. Fuster, S. Almenar, L. Ramón, K. Radtke, and A. Estellés, "Expression of Plasminogen Activator Inhibitors Type 1 and Type 3 and Urokinase Plasminogen Activator Protein and mRNA in Breast Cancer," Thrombosis Research, vol. 120, pp. 753-762, 2007.
[56] E. Niméus-Malmström, A. Koliadi, C. Ahlin, M. Holmqvist, L. Holmberg, R. Amini, K. Jirström, F. Wärnberg, C. Blomqvist, M. Fernö, and M. Fjällskog, "Cyclin B1 is a Prognostic Proliferation Marker with a High Reproducibility in a Population-Based Lymph Node Negative Breast Cancer Cohort," Int'l J. Cancer, vol. 127, pp. 961-967, 2010.
[57] D. Cox, "Regression Models and Life-Tables," J. the Royal Statistical Soc. Series B, vol. 34, pp. 187-220, 1972.
[58] J. Storey and R. Tibshirani, "Statistical Significance for Genome-wide Studies," Proc. Nat'l Academy of Sciences USA, vol. 100, pp. 9440-9445, 2003.
[59] D. Huang, B. Sherman, and R. Lempicki, "Systematic and Integrative Analysis of Large Gene Lists Using DAVID Bioinformatics Resources," Nature Protocols, vol. 4, pp. 44-57, 2009.
[60] E. Motakis, A. Ivshina, and V. Kuznetsov, "Data-Driven Approach to Predict Survival of Cancer Patients," IEEE Eng. in Medicine and Biology Magazine, vol. 28, no. 4, pp. 58-66, July/Aug. 2009.
[61] H. Pang, D. Datta, and H. Zhao, "Pathway Analysis using Random Forests with Bivariate Node-Split for Survival Outcomes," Bioinformatics vol. 26, pp. 250-258, 2010.
[62] R Development Core Team "R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing," Vienna, Austria, http:/www.R-project.org, 2010.
34 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool