This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis
July-Aug. 2012 (vol. 9 no. 4)
pp. 1106-1119
V. de Schaetzen, Univ. Libre de Bruxelles, Brussels, Belgium
C. Molter, Univ. Libre de Bruxelles, Brussels, Belgium
A. Coletta, Univ. Libre de Bruxelles, Brussels, Belgium
D. Steenhoff, Dept. of Comput. Sci., Vrije Univ. Brussel, Brussels, Belgium
S. Meganck, Dept. of Comput. Sci., Vrije Univ. Brussel, Brussels, Belgium
J. Taminau, Dept. of Comput. Sci., Vrije Univ. Brussel, Brussels, Belgium
C. Lazar, Dept. of Comput. Sci., Vrije Univ. Brussel, Brussels, Belgium
R. Duque, Univ. Libre de Bruxelles, Brussels, Belgium
H. Bersini, IRIDIA, Univ. Libre de Bruxelles, Bruxelles, Belgium
A. Nowe, Dept. of Comput. Sci., Vrije Univ. Brussel, Brussels, Belgium
A plenitude of feature selection (FS) methods is available in the literature, most of them rising as a need to analyze data of very high dimension, usually hundreds or thousands of variables. Such data sets are now available in various application areas like combinatorial chemistry, text mining, multivariate imaging, or bioinformatics. As a general accepted rule, these methods are grouped in filters, wrappers, and embedded methods. More recently, a new group of methods has been added in the general framework of FS: ensemble techniques. The focus in this survey is on filter feature selection methods for informative feature discovery in gene expression microarray (GEM) analysis, which is also known as differentially expressed genes (DEGs) discovery, gene prioritization, or biomarker discovery. We present them in a unified framework, using standardized notations in order to reveal their technical details and to highlight their common characteristics as well as their particularities.

[1] Y. Saeys, I. Inza, and P. Larrañaga, "A Review of Feature Selection Techniques in Bioinformatics," Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007.
[2] T.R. Golub et al., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, pp. 531-537, 1999.
[3] C.A. Penfold and D.L. Wild, "How to Infer Gene Networks from Expression Profiles, Revisited," Interface Focus, vol. 1, no. 6, pp. 857-870, 2011.
[4] R.L. Somorjai, B. Dolenko, and R. Baumgartner, "Class Prediction and Discovery Using Gene Microarray and Proteomics Mass Spectroscopy Data: Curses, Caveats, Cautions," Bioinformatics, vol. 19, no. 12, pp. 1484-1491, 2003.
[5] P. Yang et al., "A Review of Ensemble Methods in Bioinformatics," Current Bioinformatics, vol. 5, no. 4, pp. 296-308, 2010.
[6] I. Guyon, "An Introduction to Variable and Feature Selection," J. Machine Learning Research, vol. 3, pp. 1157-1182, 2003.
[7] A.-C. Haury, P. Gestraud, and J.-P. Vert, "The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures," PLoS ONE, vol. 6, no. 12, p. e28210, 2011.
[8] M. Bansal et al., "How to Infer Gene Networks from Expression Profiles," Moleculer Systems Biology, vol. 3, p. 78, 2007.
[9] I. Guyon et al., "Gene Selection for Cancer Classification Using Support Vector Machines," Machine Learning, vol. 46, nos. 1-3, pp. 389-422, 2002.
[10] T. Zhang, "On the Consistency of Feature Selection Using Greedy Least Squares Regression," J. Machine Learning Research, vol. 10, pp. 555-568, 2009.
[11] S.-H. Cha, "Comprehensive Survey on Distance/Similarity Measures Between Probability Density Functions," Int'l J. Math. Models and Methods in Applied Sciences, vol. 1, no. 4, pp. 300-307, 2007.
[12] L. Deng et al., "A Rank Sum Test Method for Informative Gene Discovery," Proc. 10th ACM SIGKDD Int'l Conf. Knowldge Discovery and Data Mining, pp. 410-419, 2004.
[13] R. Breitling et al., "Rank Products: A Simple, Yet Powerful, New Method to Detect Differentially Regulated Genes in Replicated Microarray Experiments," FEBS Letters, vol. 573, nos. 1-3, pp. 83-92, 2004.
[14] D. Witten and R. Tibshirani, "A Comparison of Fold-Change and the t-Statistic for Microarray Data Analysis," technical report, Stanford Univ., 2007.
[15] H. Tao et al., "Functional Genomics: Expression Analysis of Escherichia Coli Growing on Minimal and Rich Media," J. Bacteriology, vol. 181, pp. 6425-6440, 1999.
[16] M.K. Kerr, M. Martin, and G.A. Churchill, "Analysis of Variance for Gene Expression Microarray Data," J. Computational Biology, vol. 7, no. 6, pp. 819-837, 2000.
[17] J.G. Thomas et al., "An Efficient and Robust Statistical Modeling Approach to Discover Differentially Expressed Genes Using Genomic Expression Profiles," Genome Research, vol. 11, no. 7, pp. 1227-1236, 2001.
[18] S. Dudoit et al., "Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments," Statistica Sinica, vol. 12, pp. 111-139, 2002.
[19] V.G. Tusher, R. Tibshirani, and G. Chu, "Significance Analysis of Microarrays Applied to the Ionizing Radiation Response," Proc. Nat'l Academy of Sciences USA, vol. 98, no. 9, pp. 5116-5121, 2001.
[20] R. Tibshirani et al., "Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression," Proc. Nat'l Academy of Sciences USA, vol. 99, no. 10, pp. 6567-6572, 2002.
[21] B. Efron et al., "Empirical Bayes Analysis of a Microarray Experiment," J. Am. Statistical Assoc., vol. 96, no. 456, pp. 1151-1160, 2001.
[22] A.D. Long et al., "Improved Statistical Inference from DNA Microarray Data Using Analysis of Variance and A Bayesian Statistical Framework," J. Biological Chemistry, vol. 276, no. 23, pp. 19937-19944, 2001.
[23] P. Baldi and A.D. Long, "A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized t-Test and Statistical Inferences of Gene Changes," Bioinformatics, vol. 17, no. 6, pp. 509-519, 2001.
[24] G.K. Smyth, "Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments," Statistical Applications in Genetics and Moleculer Biology, vol. 3, no. 1, pp. 1-25, 2004.
[25] I. Lönnstedt and T. Speed, "Replicated Microarray Data," Statistica Sinica, vol. 12, p. 31, 2001.
[26] E. Parzen, "On Estimation of a Probability Density Function and Mode," The Annals of Math. Statistics, vol. 33, no. 3, pp. 1065-1076, 1962.
[27] A. Wilinski, S. Osowski, and K. Siwek, "Gene Selection for Cancer Classification through Ensemble of Methods," Proc. Ninth Int'l Conf. Adaptive and Natural Computing Algorithms (ICANNGA '09), pp. 507-516, 2009.
[28] X. Yan et al., "Detecting Differentially Expressed Genes by Relative Entropy," J. Theoretical Biology, vol. 234, no. 3, pp. 395-402, 2005.
[29] J.-G. Zhang and H.-W. Deng, "Gene Selection for Classification of Microarray Data Based on the Bayes Error," BMC Bioinformatics, vol. 8, no. 1,article 370, 2007.
[30] L.-Y. Chuang et al., "A Two-Stage Feature Selection Method for Gene Expression Data," OMICS: J. Integrative Biology, vol. 13, pp. 127-137, 2009.
[31] R. Steuer et al., "The Mutual Information: Detecting and Evaluating Dependencies Between Variables," Bioinformatics, vol. 18, suppl. 2, pp. S23-S240, 2002.
[32] X. Liu, A. Krishnan, and A. Mondry, "An Entropy-Based Gene Selection Method for Cancer Classification Using Microarray Data," BMC Bioinformatics, vol. 6, article 76, 2005.
[33] B.M. Park PJ and M. Pagano, "A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data," Proc. Pacific Symp. Biocomputing, pp. 52-63, 2001.
[34] L.J. van 't Veer et al., "Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer," Nature, vol. 415, no. 6871, pp. 530-536, 2002.
[35] S. Parodi, V. Pistoia, and M. Muselli, "Not Proper Roc Curves as New Tool for the Analysis of Differentially Expressed Genes in Microarray Experiments," BMC Bioinformatics, vol. 9, no. 1,article 410, 2008.
[36] S. Dudoit, J. Fridlyand, and T.P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, no. 457, pp. 77-87, 2002.
[37] A. Ben-Dor et al., "Tissue Classification with Gene Expression Profiles," J. Computational Biology, vol. 7, pp. 559-583, 2000.
[38] J. Cohen, "The Earth is Round (p < .05)," Am. Psychologist, vol. 38, pp. 997-1003, 1994.
[39] W. Pan, J. Lin, and C.T. Le, "A Mixture Model Approach to Detecting Differentially Expressed Genes with Microarray Data," Functional and Integrative Genomics, vol. 3, no. 3, pp. 117-124, 2003.
[40] B. Efron et al., "Microarrays and Their Use in a Comparative Experiment," technical report, Dept. of Statistics, Stanford Univ., 2000.
[41] S. Dudoit, J.P. Shaffer, and J.C. Boldrick, "Multiple Hypothesis Testing in Microarray Experiments," Statistical Science, vol. 18, no. 1, pp. 71-103, 2003.
[42] Y. Benjamini and Y. Hochberg, "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing," J. Royal Statistical Soc. Series B (Methodological), vol. 57, no. 1, pp. 289-300, 1995.
[43] D. Storey, "The Positive False Discovery Rate: A Bayesian Interpretation and the q-Value," Annals of Statistics, vol. 31, pp. 2013-2035, 2003.
[44] J.D. Storey, "A Direct Approach to False Discovery Rates," J. Royal Statistics Soc.: Series B, vol. 64, no. 3, pp. 479-498, 2002.
[45] J. DeRisi, V. Iyer, and P. Brown, "Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale," Science, vol. 278, no. 5338, pp. 680-686, 1997.
[46] S. Draghici et al., "Noise Sampling Method: An ANOVA Approach Allowing Robust Selection of Differentially Regulated Genes Measured by DNA Microarrays," Bioinformatics, vol. 19, no. 11, pp. 1348-1359, 2003.
[47] M.A. Newton et al., "On Differential Variability of Expression Ratios: Improving Statistical Inference About Gene Expression Changes from Microarray Data," J. Computational Biology, vol. 8, pp. 37-52, 2001.
[48] R. Breitling et al., "Rank Products: A Simple, Yet Powerful, New Method to Detect Differentially Regulated Genes in Replicated Microarray Experiments," FEBS Letters, vol. 573, nos. 1-3, pp. 83-92, 2004.
[49] S. Dudoit, J. Fridlyand, and T. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, no. 457, pp. 77-87, 2002.
[50] W. Pan, "On the Use of Permutation in and the Performance of a Class of Nonparametric Methods to Detect Differential Gene Expression," Bioinformatics, vol. 19, no. 11, pp. 1333-13340, 2003.
[51] T. Bø and I. Jonassen, "New Feature Subset Selection Procedures for Classification of Expression Profiles," Genome Biology, vol. 4, no. 4, pp. research0017.1-research0017.11, 2002.
[52] D. Geman et al., "Classifying Gene Expression Profiles from Pairwise mrna Comparisons," Statistical Applications in Genetics and Moleculer Biology, vol. 3, pp. 1-19, 2004.
[53] K. Yeung and R. Bumgarner, "Multiclass Classification of Microarray Data with Repeated Measurements: Application to Cancer," Genome Biology, vol. 4, no. 12, p. R83, 2003.
[54] C. Ding and H. Peng, "Minimum Redundancy Feature Selection from Microarray Gene Expression Data," J. Bioinformatics and Computational Biology, pp. 185-205, 2005.
[55] Y. Wang et al., "Gene Selection from Microarray Data for Cancer Classification—A Machine Learning Approach," Computational Biology and Chemistry, vol. 29, no. 1, pp. 37-46, 2005.
[56] E.P. Xing, M.I. Jordan, and R.M. Karp, "Feature Selection for High-Dimensional Genomic Microarray Data," Proc. 18th Int'l Conf. Machine Learning (ICML '01), pp. 601-608, 2001.
[57] Y. Saeys, T. Abeel, and Y. Peer, "Robust Feature Selection Using Ensemble Feature Selection Techniques," Proc. European Conf. Machine Learning and Knowledge Discovery in Databases, pp. 313-325, 2008.
[58] T. Fawcett, "Roc Graphs: Notes and Practical Considerations for Researchers," technical report, 2004.
[59] J. Davis and M. Goadrich, "The Relationship Between Precision-Recall and ROC Curves," Proc. 23rd Int'l Conf. Machine Learning, pp. 233-240, 2006.
[60] R. Powers, M. Goldszmidt, and I. Cohen, "Short Term Performance Forecasting in Enterprise Systems," Proc. 11th ACM SIGKDD Int'l Conf. Knowledge Discovery in Data Mining (KDD '05), pp. 801-807, 2005.
[61] A. Ben-Dor and Z. Yakhini, "Clustering Gene Expression Patterns," Proc. Third Ann. Int'l Conf. Computational Moleculer Biology (RECOMB '99), pp. 33-42, 1999.
[62] L. Ein-Dor et al., "Outcome Signature Genes in Breast Cancer: Is There a Unique Set?" Bioinformatics, vol. 21, no. 2, pp. 171-178, 2005.
[63] S. Michiels, S. Koscielny, and C. Hill, "Prediction of Cancer Outcome with Microarrays: A Multiple Random Validation Strategy," Lancet, vol. 365, no. 9458, pp. 488-492, 2005.
[64] A.-L. Boulesteix and M. Slawski, "Stability and Aggregation of Ranked Gene Lists," Briefings in Bioinformatics, vol. 10, no. 5, pp. 556-568, 2009.
[65] K. Kadota, Y. Nakai, and K. Shimizu, "Ranking Differentially Expressed Genes from Affymetrix Gene Expression Data: Methods with Reproducibility, Sensitivity, and Specificity," Algorithms for Moleculer Biology, vol. 4, p. 7, 2009.
[66] M. Zhang et al., "Evaluating Reproducibility of Differential Expression Discoveries in Microarray Studies by Considering Correlated Molecular Changes," Bioinformatics, vol. 25, no. 13, pp. 1662-1668, 2009.
[67] R.A. Irizarry et al., "Multiple-Laboratory Comparison of Microarray Platforms," Nature Methods, vol. 2, no. 5, pp. 345-350, 2005.
[68] X. Yang et al., "Similarities of Ordered Gene Lists," J. Bioinformatics and Computational Biology, vol. 4, no. 3, pp. 693-708, 2006.
[69] I. Jeffery, D. Higgins, and A. Culhane, "Comparison and Evaluation of Methods for Generating Differentially Expressed Gene Lists from Microarray Data," BMC Bioinformatics, vol. 7, no. 1,article 359, 2006.
[70] J. Taminau et al., "inSilicoDb: An R/Bioconductor Package for Accessing Human Affymetrix Expert-Curated Data sets from GEO," Bioinformatics, vol. 27, pp. 3204-3205, 2011.

Index Terms:
information filters,arrays,bioinformatics,genetics,standardized notations,gene expression microarray analysis,combinatorial chemistry,text mining,multivariate imaging,bioinformatics,filter feature selection methods,GEM analysis,differentially expressed gene discovery,gene prioritization,biomarker discovery,Gene expression,Taxonomy,Bioinformatics,Measurement,Search methods,Computational biology,gene expression data.,Feature selection,information filters,gene ranking,biomarker discovery,gene prioritization,scoring functions,statistical methods
Citation:
V. de Schaetzen, C. Molter, A. Coletta, D. Steenhoff, S. Meganck, J. Taminau, C. Lazar, R. Duque, H. Bersini, A. Nowe, "A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 1106-1119, July-Aug. 2012, doi:10.1109/TCBB.2012.33
Usage of this product signifies your acceptance of the Terms of Use.