This Article 
 Bibliographic References 
 Add to: 
Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis
July-September 2007 (vol. 4 no. 3)
pp. 365-381
Extracting a subset of informative genes from microarray expression data is a critical data preparation step in cancer classification and other biological function analyses. Though many algorithms have been developed, the Support Vector Machine - Recursive Feature Elimination (SVM-RFE) algorithm is one of the best gene feature selection algorithms. It assumes that a smaller "filter-out" factor in the SVM-RFE, which results in a smaller number of gene features eliminated in each recursion, should lead to extraction of a better gene subset. Because the SVM-RFE is highly sensitive to the "filter-out" factor, our simulations have shown that this assumption is not always correct and that the SVM-RFE is an unstable algorithm. To select a set of key gene features for reliable prediction of cancer types or subtypes and other applications, a new two-stage SVM-RFE algorithm has been developed. It is designed to effectively eliminate most of the irrelevant, redundant and noisy genes while keeping information loss small at the first stage. A fine selection for the final gene subset is then performed at the second stage. The two-stage SVM-RFE overcomes the instability problem of the SVM-RFE to achieve better algorithm utility. We have demonstrated that the two-stage SVM-RFE is significantly more accurate and more reliable than the SVM-RFE and three correlation-based methods based on our analysis of three publicly available microarray expression datasets. Furthermore, the two-stage SVM-RFE is computationally efficient because its time complexity is $O(d * \log{_2d})$, where $d$ is the size of the original gene set.

[1] G. Piatetsky-Shapiro and P. Tamayo, “Microarray Data Mining: Facing the Challenges,” SIGKDD Explorations, vol. 5, no. 2, pp. 1-5, 2003.
[2] D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene Expression Data: A Survey,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1370-1386, Nov. 2004.
[3] E. Bair and R. Tibshirani, “Machine Learning Methods Applied to DNA Microarray Data Can Improve the Diagnosis of Cancer,” SIGKDD Explorations, vol. 5, no. 2, pp. 48-55, 2003.
[4] W.S. Noble, “Support Vector Machine Applications in Computational Biology,” Kernel Methods in Computational Biology, pp. 71-92, 2004.
[5] E.J. Moler, M.L. Chow, and I.S. Mian, “Analysis of Molecular Profile Data Using Generative and Discriminative Methods,” Physiological Genomics, vol. 4, pp. 109-126, 2000.
[6] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 6745-6750, 1999.
[7] F. Model, P. Adorjan, A. Olek, and C. Piepenbrock, “Feature Selection for DNA Methylation Based Cancer Classification,” Bioinformatics, vol. 17, no. 1, pp. S157-S164, 2001.
[8] D. Hand, H. Mannila, and P. Smyth, Principle of Data Mining. MIT Press, 2001.
[9] T.R. Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, 1999.
[10] V. Vapnik, Statistical Learning Theory. John Wiley and Sons, 1998.
[11] C.J.C. Burges, “A Tutorial On Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 1998.
[12] B. Schölkopf, I. Guyon, and J. Weston, “Statistical Learning and Kernel Methods in Bioinformatics,” Artificial Intelligence and Heuristic Methods in Bioinformatics 183, P. Frasconi and R. Shamir, eds., pp. 1-21, IOS Press, 2003.
[13] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge Univ. Press, 1999.
[14] K.K. Chin, “Support Vector Machines Applied to Speech Pattern Classification,” master's thesis, Eng. Dept., Cambridge Univ., 1999.
[15] S. Gunn, “Support Vector Machines for Classification and Regression,” ISIS Technical Report MP-TR-98-05, Image Speech and Intelligent Systems Group, Univ. of Southampton, 1998.
[16] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, pp. 389-422, 2002.
[17] T. Furey, N. Cristianini, N. Duffy, D. Bednarski, M. Schummer, and D. Haussler, “Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data,” Bioinformatics, vol. 16, pp. 906-914, 2000.
[18] P. Pavlidis et al., “Gene Functional Analysis from Heterogeneous Data,” Proc. Conf. Research in Computational Molecular Biology (RECOMB), pp. 249-255, 2001.
[19] K. Duan and J.C. Rajapakse, “A Variant of SVM-RFE for Gene Selection in Cancer Classification with Expression Data,” Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology, pp. 49-55, 2004.
[20] Y. LeCun, J. Denker, S. Solla, R. Howard, and L.D. Jackel, “Optimal Brain Damage,” Advances in Neural Information Processing Systems II, 1990.
[21] U. Braga-Neto and E.R. Dougherty, “Is Cross-Validation Valid for Small-Sample Microarray Classification?” Bioinformatics, vol. 20, pp. 374-380, 2004.
[22] M. Chernick, Bootstrap Methods: A Practitioner's Guide. Wiley, 1999.
[23] F. Li and Y. Yang, “Analysis of Recursive Gene Selection Approaches from Microarray Data,” Bioinformatics, vol. 21, no. 19, pp. 3741-3747, 2005.
[24] C. Ambroise and G.J. McLachlan, “Selection Bias in Gene Extraction on the Basis of Microarray Gene Expression Data,” Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 6562-6566, 2002.
[25] A.A. Alizadeh et al., “Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling,” Nature, vol. 403, no. 6769, pp. 503-511, 2000.
[26] J. Weston et al., “Use of the Zero-Norm with Linear Models and Kernel Methods,” J. Machine Learning Research, vol. 3, pp. 1439-1461, 2003.
[27] A.P. Bradley, “The Use of the Area under the Roc Curve in the Evaluation of Machine Learning Algorithms,” Pattern Recognition, vol. 30, no. 7, pp. 1145-1159, 1997.
[28] J.F. Novak and F. Trnka, “Proenzyme Therapy of Cancer,” Anticancer Research, vol. 25, no. 2A, pp. 1157-1177, 2005.
[29] W. Takahashi, K. Sasaki, N. Kvomatsu, and K. Mitani, “TEL/ETV6 Accelerates Erythroid Differentiation and Inhibits Megakaryocytic Maturation in a Human Leukemia Cell Line UT-7/GM,” Cancer Science, vol. 96, pp. 340-348, 2005.
[30] Z. Zhou, J. Wang, R. Cao, H. Morita, R. Soininen, K.M. Chan, B. Liu, Y. Cao, and K. Tryggvason, “Impaired Angiogenesis, Delayed Wound Healing and Retarded Tumor Growth in Perlecan Heparan Sulfate-Deficient Mice,” Cancer Research, vol. 64, no. 14, pp. 4699-702, 2004.
[31] F. Cetani, E. Pardi, P. Viacava, G.D. Pollina, G. Fanelli, A. Picone, S. Borsari, E. Gazzerro, P. Miccoli, P. Berti, A. Pinchera, and C.A. Marcocci, “Reappraisal of the Rb1 Gene Abnormalities in the Diagnosis of Parathyroid Cancer,” Clinical Endocrinology (Oxf), vol. 60, no. 1, pp. 99-106, 2004.
[32] D.R. Lohmann and B.L. Gallie, “Retinoblastoma: Revisiting the Model Prototype of Inherited Cancer,” Am. J. Medical Genetics, vol. 129C, no. 1, pp. 23-28, 2004.
[33] B.E. Peace, K.J. Hill, S.J. Degen, and S.E. Waltz, “Cross-Talk between the Receptor Tyrosine Kinases Ron and Epidermal Growth Factor Receptor,” Experimental Cell Research, vol. 289, no. 2, pp. 317-325, 2003.
[34] K.T. Patton, M.S. Tretiakova, J.L. Yao, V. Papavero, L. Huo, B.P. Adley, G. Wu, J. Huang, M.R. Pins, B.T. Teh, and X.J. Yang, “Expression of RON Proto-Oncogene in Renal Oncocytoma and Chromophobe Renal Cell Carcinoma,” Am. J. Surgical Pathology, vol. 28, no. 8, pp. 1045-1050, 2004.
[35] R. Wadgaonkar, L. Linz-McGillem, A.L. Zaiman, and J.G. Garcia, “Endothelial Cell Myosin Light Chain Kinase (MLCK) Regulates TNFalpha-Induced NFkappaB activity,” J. Cellular Biochemistry, vol. 94, no. 2, pp. 351-364, 2005.
[36] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, “Tissue Classification with Gene Expression Profiles,” J. Computational Biology, vol. 7, pp. 559-583, 2000.
[37] V. Roth, “The Generalized LASSO: A Wrapper Approach to Gene Selection for Microarray Data,” Technical Report IAI-TR-2002-8, Dept. of Computer Science III, Univ. of Bonn, 2002.
[38] C. Furlanello, M. Serafini, S. Merler, and G. Jurman, “Entropy-Based Gene Ranking without Selection Bias for the Predictive Classification of Microarray Data,” BMC Bioinformatics, vol. 4, p.54, 2003.
[39] M.P. Oyarzo, P. Lin, A. Glassman, C.E. Bueso-Ramos, R. Luthra, and L.J. Medeiros, “Acute Myeloid Leukemia with t(6;9)(p23;q34) Is Associated with Dysplasia and a High Frequency of flt3 Gene Mutations,” Am. J. Clinical Pathology, vol. 122, pp. 348-58, 2004.
[40] P.L. Tazzari, A. Cappellini, T. Grafone, I. Mantovani, F. Ricci, A.M. Billi, E. Ottaviani, R. Conte, G. Martinelli, and A.M. Martelli, “Detection of Serine 473 Phosphorylated Akt in Acute Myeloid Leukaemia Blasts by Flow Cytometry,” British J. Haematology, vol. 126, pp. 675-81, 2004.
[41] D.X. Liu, S.C. Biswas, and L.A. Greene, “B-myb and C-myb Play Required Roles in Neuronal Apoptosis Evoked by Nerve Growth Factor Deprivation and DNA Damage,” J. Neuroscience, vol. 24, pp.8720-8725, 2004.
[42] F. Pastorino, C. Brignole, D. Marimpietri, G. Pagnan, A. Morando, D. Ribatti, S.C. Semple, C. Gambini, T.M. Allen, and M. Ponzoni, “Targeted Liposomal C-Myc Antisense Oligodeoxynucleotides Induce Apoptosis and Inhibit Tumor Growth and Metastases in Human Melanoma Models,” Clinical Cancer Research, vol. 9, pp.4595-605, 2003.
[43] J. Golab, “Interleukin 18—Interferon Gamma Inducing Factor—A Novel Player in Tumor Immunotherapy?” Cytokine, vol. 12, pp.332-338, 2000.
[44] C.A. Dinarello, “Interleukin-18,” Methods, vol. 19, pp. 121-132, 1999.
[45] P. Khatri and S. Draghici, “Ontological Analysis of Gene Expression Data: Current Tools, Limitations, and Open Problems,” Bioinformatics, vol. 21, no. 18, pp. 3587-3595, 2005.
[46] P. Khatri, S. Sellamuthu, P. Malhotra, K. Amin, A. Done, and S. Draghici, “Recent Additions and Improvements to the Onto-Tools,” Nucleic Acids Research, vol. 33, pp. W762-W765, 2005.
[47] P. Khatri, P. Bhavsar, G. Bawa, and S. Draghici, “Onto-Tools: An Ensemble of Web-Accessible, Ontology-Based Tools for the Functional Design and Interpretation of High-Throughput Gene Expression Experiments,” Nucleic Acids Research, vol. 32, pp.W449-W456, 2004.
[48] S. Draghici, P. Khatri, P. Bhavsar, A. Shah, S. Krawetz, and M.A. Tainsky, “Onto-Tools, the Toolkit of the Modern Biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate,” Nucleic Acids Research, vol. 31, no. 13, pp. 3775-3781, 2003.
[49] S. Draghici, P. Khatri, R.P. Martins, G.C. Ostermeier, and S.A. Krawetz, “Global Functional Profiling of Gene Expression,” Genomics, vol. 81, pp. 98-104, 2003.
[50] P. Khatri, S. Draghici, G.C. Ostermeier, and S.A. Krawetz, “Profiling Gene Expression Utilizing Onto-Express,” Genomics, vol. 79, no. 2, pp. 266-270, 2002.

Index Terms:
Bioinformatics, Microarray Gene Expression Data Analysis, Cancer Classification, Support Vector Machines, Gene Selection, Feature Selection, Recursive Feature Elimination
Yuchun Tang, Yan-Qing Zhang, Zhen Huang, "Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 365-381, July-Sept. 2007, doi:10.1109/TCBB.2007.70224
Usage of this product signifies your acceptance of the Terms of Use.