This Article 
 Bibliographic References 
 Add to: 
Stable Gene Selection from Microarray Data via Sample Weighting
January/February 2012 (vol. 9 no. 1)
pp. 262-272
Lei Yu, Dept. of Comput. Sci., State Univ. of New York, Binghamton, NY, USA
Yue Han, Dept. of Comput. Sci., State Univ. of New York, Binghamton, NY, USA
M. E. Berens, Cancer & Cell Biol. Div., Translational Genomics Res. Inst., Phoenix, AZ, USA
Feature selection from gene expression microarray data is a widely used technique for selecting candidate genes in various cancer studies. Besides predictive ability of the selected genes, an important aspect in evaluating a selection method is the stability of the selected genes. Experts instinctively have high confidence in the result of a selection method that selects similar sets of genes under some variations to the samples. However, a common problem of existing feature selection methods for gene expression data is that the selected genes by the same method often vary significantly with sample variations. In this work, we propose a general framework of sample weighting to improve the stability of feature selection methods under sample variations. The framework first weights each sample in a given training set according to its influence to the estimation of feature relevance, and then provides the weighted training set to a feature selection method. We also develop an efficient margin-based sample weighting algorithm under this framework. Experiments on a set of microarray data sets show that the proposed algorithm significantly improves the stability of representative feature selection algorithms such as SVM-RFE and ReliefF, without sacrificing their classification performance. Moreover, the proposed algorithm also leads to more stable gene signatures than the state-of-the-art ensemble method, particularly for small signature sizes.

[1] T. Abeel, T. Helleputte, Y.V. Peer, P. Dupont, and Y. Saeys, “Robust Biomarker Identification for Cancer Diagnosis with Ensemble Feature Selection Methods,” Bioinformatics, vol. 26, no. 3, pp. 392-398, 2010.
[2] U. Alon, N. Barkai, D.A. Notterman, K. Gishdagger, S. Ybarradagger, D. Mackdagger, and A.J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 6745-6750, 1999.
[3] A.L. Boulesteix and M. Slawski, “Stability and Aggregation of Ranked Gene Lists,” Briefings in Bioinformatics, vol. 10, no. 5, pp. 556-568, 2009.
[4] M. Cargill, D. Altshuler, J. Ireland, P. Sklar, K. Ardlie, N. Patil, N. Shaw, C.R. Lane, E.P. Lim, N. Kalyanaraman, J. Nemesh, L. Ziaugra, L. Friedland, A. Rolfe, J. Warrington, R. Lipshutz, G.Q. Daley, and E.S. Lander, “Characterization of Single-Nucleotide Polymorphisms in Coding Regions of Human Genes,” Nature Genetics, vol. 22, pp. 231-238, 1999.
[5] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, pp. 273-297, 1995.
[6] K. Crammer, R. Gilad-Bachrach, and A. Navot, “Margin Analysis of the LVQ Algorithm,” Proc. 17th Conf. Neural Information Processing Systems, pp. 462-469, 2002.
[7] C.A. Davis, F. Gerick, V. Hintermair, C.C. Friedel, K. Fundel, R. Küffner, and R. Zimmer, “Reliable Gene Signatures for Microarray Classification: Assessment of Stability and Performance,” Bioinformatics, vol. 22, pp. 2356-2363, 2006.
[8] K.B. Duan, J.C. Rajapakse, H. Wang, and F. Azuaje, “Multiple SVM-RFE for Gene Selection in Cancer Classification with Expression Data,” IEEE Trans. NanoBioscience, vol. 4, no. 3, pp. 228-234, Sept. 2005.
[9] J. Dutkowski and A. Gambin, “On Consensus Biomarker Selection,” BMC Bioinformatics, vol. 8(Suppl 5):S5, 2007, doi:10.11861471-2105-8-S5-S5.
[10] L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany, “Outcome Signature Genes in Breast Cancer: Is There a Unique Set?” Bioinformatics, vol. 21, pp. 171-178, 2005.
[11] L. Ein-Dor, O. Zuk, and E. Domany, “Thousands of Samples Are Needed to Generate a Robust Gene List for Predicting Outcome in Cancer,” Proc. Nat'l Academy of Sciences USA, vol. 103, no. 15, pp. 5923-5928, 2006.
[12] Y. Freund and R.E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Computer Systems and Science, vol. 55, no. 1, pp. 119-139, 1997.
[13] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, 1999.
[14] G.J. Gordon, R.V. Jensen, L. Hsiaoand, S.R. Gullans, J.E. Blumenstock, S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, and R. Bueno, “Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma,” Cancer Research, vol. 62, pp. 4963-4967, 2002.
[15] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, pp. 389-422, 2002.
[16] Y. Han and L. Yu, “A Variance Reduction Framework for Stable Feature Selection,” Proc. 10th IEEE Int'l Conf. Data Mining, pp. 206-215, 2010.
[17] T. Helleputte and P. Dupont, “Feature Selection by Transfer Learning with Linear Regularized Models,” Proc. 19th European Conf. Machine Learning (ECML '09), pp. 533-547, 2009.
[18] T. Helleputte and P. Dupont, “Partially Supervised Feature Selection with Regularized Linear Models,” Proc. 26th Int'l Conf. Machine Learning, pp. 409-416, 2009.
[19] G. Jurman, S. Merler, A. Barla, S. Paoli, A. Galea, and C. Furlanello, “Algebraic Stability Indicators for Ranked Lists in Molecular Profiling,” Bioinformatics, vol. 24, no. 2, pp. 258-264, 2008.
[20] A. Kalousis, J. Prados, and M. Hilario, “Stability of Feature Selection Algorithms: A Study on High-Dimensional Spaces,” Knowledge and Information Systems, vol. 12, pp. 95-116, 2007.
[21] L. Kuncheva, “A Stability Index for Feature Selection,” Proc. 25th Int'l Multi-Conf.: Artificial Intelligence and Applications, pp. 390-395, 2007.
[22] K.E. Lee, N. Sha, E.R. Dougherty, M. Vannucci, and B.K. Mallick, “Gene Selection: A Bayesian Variable Selection Approach,” Bioinformatics, vol. 19, no. 1, pp. 90-97, 2003.
[23] T. Li, C. Zhang, and M. Ogihara, “A Comparative Study of Feature Selection and Multiclass Classification Methods for Tissue Classification Based on Gene Expression,” Bioinformatics, vol. 20, pp. 2429-2437, 2004.
[24] H. Liu, J. Li, and L. Wong, “A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns,” Genome Informatics, vol. 13, pp. 51-60, 2002.
[25] S. Loscalzo, L. Yu, and C. Ding, “Consensus Group Based Stable Feature Selection,” Proc. 15th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '09), pp. 567-576, http://portal. acm.orgcitation.cfm?id=1557019.1557084 , 2009.
[26] P.A. Mundra and J.C. Rajapakse, “SVM-RFE with MRMR Filter for Gene Selection,” IEEE Trans. NanoBioscience, vol. 9, no. 1, pp. 31-37, Mar. 2010.
[27] M.S. Pepe, R. Etzioni, Z. Feng, J.D. Potter, M.L. Thompson, M. Thornquist, M. Winget, and Y. Yasui, “Phases of Biomarker Development for Early Detection of Cancer,” J. Nat'l Cancer Inst., vol. 93, pp. 1054-1060, 2001.
[28] E.F. Petricoin, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Steinberg, G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn, and L.A. Liotta, “Use of Proteomic Patterns in Serum to Identify Ovarian Cancer,” Lancet, vol. 359, pp. 572-577, 2002.
[29] M. Robnik-Sikonja and I. Kononenko, “Theoretical and Empirical Analysis of Relief and ReliefF,” Machine Learning, vol. 53, pp. 23-69, 2003.
[30] B.Y. Rubinstein, Simulation and the Monte Carlo Method. John Wiley & Sons, 1981.
[31] Y. Saeys, I. Inza, and P. Larranaga, “A Review of Feature Selection Techniques in Bioinformatics,” Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007.
[32] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D'Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, and W.R. Sellers, “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell, vol. 2, pp. 203-209, 2002.
[33] Y. Tang, Y.Q. Zhang, and Z. Huang, “Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 365-381, July 2007.
[34] I.H. Witten and E. Frank, Data Mining - Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers, 2005.
[35] Y.H. Yang, Y. Xiao, and M.R. Segal, “Identifying Differentially Expressed Genes from Microarray Experiments via Statistic Synthesis,” Bioinformatics, vol. 21, no. 7, pp. 1084-1093, 2005.
[36] J. Ye, J. Chen, R. Janardan, and S. Kumar, “Using Uncorrelated Discriminant Analysis for Tissue Classification with Gene Expression Data,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 4, pp. 181-190, Oct.-Dec. 2004.
[37] K.Y. Yeung, R.E. Bumgarner, and A.E. Raftery, “Bayesian Model Averaging: Development of an Improved Multi-Class, Gene Selection and Classification Tool for Microarray Data,” Bioinformatics, vol. 21, no. 10, pp. 2394-2402, 2005.
[38] M. Zhang, L. Zhang, J. Zou, C. Yao, H. Xiao, Q. Liu, J. Wang, D. Wang, C. Wang, and Z. Guo, “Evaluating Reproducibility of Differential Expression Discoveries in Microarray Studies by Considering Correlated Molecular Changes,” Bioinformatics, vol. 25, no. 13, pp. 1662-1668, 2009.
[39] S. Zhu, D. Wang, K. Yu, T. Li, and Y. Gong, “Feature Selection for Gene Expression Using Model-Based Entropy,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 25-36, Jan.-Mar. 2010.

Index Terms:
support vector machines,arrays,biology computing,feature extraction,genetics,ReliefF algorithm,gene expression microarray data,gene selection,feature relevance estimation,weighted training set,feature selection method,margin-based sample weighting algorithm,SVM-RFE algorithm,Training,Stability analysis,Gene expression,Cancer,Bioinformatics,Support vector machines,Monte Carlo methods,gene expression microarray.,Feature selection,gene selection,stability,classification
Lei Yu, Yue Han, M. E. Berens, "Stable Gene Selection from Microarray Data via Sample Weighting," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 1, pp. 262-272, Jan.-Feb. 2012, doi:10.1109/TCBB.2011.47
Usage of this product signifies your acceptance of the Terms of Use.