The Community for Technology Leaders
RSS Icon
Issue No.01 - January-February (2011 vol.8)
pp: 266-272
K.Z. Mao , Nanyang Technological University, Singapore
Mahalanobis class separability measure provides an effective evaluation of the discriminative power of a feature subset, and is widely used in feature selection. However, this measure is computationally intensive or even prohibitive when it is applied to gene expression data. In this study, a recursive approach to Mahalanobis measure evaluation is proposed, with the goal of reducing computational overhead. Instead of evaluating Mahalanobis measure directly in high-dimensional space, the recursive approach evaluates the measure through successive evaluations in 2D space. Because of its recursive nature, this approach is extremely efficient when it is combined with a forward search procedure. In addition, it is noted that gene subsets selected by Mahalanobis measure tend to overfit training data and generalize unsatisfactorily on unseen test data, due to small sample size in gene expression problems. To alleviate the overfitting problem, a regularized recursive Mahalanobis measure is proposed in this study, and guidelines on determination of regularization parameters are provided. Experimental studies on five gene expression problems show that the regularized recursive Mahalanobis measure substantially outperforms the nonregularized Mahalanobis measures and the benchmark recursive feature elimination (RFE) algorithm in all five problems.
Gene selection, recursive Mahalanobis measure, regularized Mahalanobis measure.
K.Z. Mao, "Recursive Mahalanobis Separability Measure for Gene Subset Selection", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 1, pp. 266-272, January-February 2011, doi:10.1109/TCBB.2010.43
[1] D. Singh, G.P. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D'Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, and W.R. Sellers, "Gene Expression Correlates of Clinical Prostate Cancer Behavior," Cancer Cell, vol. 1, no. 2,pp. 203-209, 2002.
[2] M. Xiong, X. Fang, and J. Zhao, "Biomarker Identification by Feature Wrappers," Genome Research, vol. 11, no. 11, pp. 1878-1887, 2001.
[3] I. Inza, B. Sierra, R. Blanco, and P. Larranaga, "Gene Selection by Sequential Search Wrapper Approaches in Microarray Cancer Class Prediction," J. Intelligent and Fuzzy Systems, vol. 12, no. 1, pp. 25-33, 2002.
[4] C.H. Ooi and P. Tan, "Genetic Algorithms Applied to Multi-Class Prediction for the Analysis of Gene Expression Data," Bioinformatics, vol. 19, no. 1, pp. 37-44, 2003.
[5] U.M. Braga-Neto and E.R. Dougherty, "Is Cross-Validation Valid for Small-Sample Microarray Classification?," Bioinformatics, vol. 20, no. 3, pp. 374-380, 2004.
[6] X. Zhou and K.Z. Mao, "The Ties Problem Resulting from Counting-Based Error Estimators and Its Impact on Gene Section Algorithms," Bioinformatics, vol. 22, no. 20, pp. 2507-2515, 2006.
[7] Y. Li, C. Campbell, and M. Tipping, "Bayesian Automatic Relevance Determination Algorithms for Classifying Gene Expression Data," Bioinformatics, vol. 18, no. 10, pp. 1332-1339, 2002.
[8] K.E. Lee, N. Sha, E.R. Dougherty, M. Vannucci, and B.K. Mallick, "Gene Selection: A Bayesian Variable Selection Approach," Bioinformatics, vol. 19, no. 1, pp. 90-97, 2003.
[9] K. Bae and B.K. Mallick, "Gene Selection Using a Two-Level Hierarchical Bayesian Model," Bioinformatics, vol. 20, no. 18, pp. 3423-3430, 2005.
[10] K.Y. Yeung, R.E. Bumgarner, and A.E. Raftery, "Bayesian Model Averaging: Development of an Improved Multi-Class, Gene Selection and Classification Tool for Microarray Data," Bioinformatics, vol. 21, no. 10, pp. 2394-2402, 2005.
[11] N. Sha, M.G. Tadesse, and M. Vannucci, "Bayesian Variable Selection for the Analysis of Microarray Data with Censored Outcomes," Bioinformatics, vol. 22, no. 18, pp. 2262-2268, 2006.
[12] S.K. Shevade and S.S. Keerthi, "A Simple and Efficient Algorithm for Gene Selection Using Sparse Logistic Regression," Bioinformatics, vol. 19, no. 17, pp. 2246-2253, 2003.
[13] J. Gui and H. Li, "Penalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings, with Applications to Microarray Gene Expression Data," Bioinformatics, vol. 21, no. 13, pp. 3001-3008, 2005.
[14] L. Fan and Y. Yang, "Analysis of Recursive Gene Selection Approaches from Microarray Data," Bioinformatics, vol. 21, no. 19, pp. 3741-3747, 2005.
[15] Z. Guan and H. Zhao, "A Semiparametric Approach for Marker Gene Selection Based on Gene Expression Data," Bioinformatics, vol. 21, no. 4, pp. 529-536, 2005.
[16] G.C. Cawley and N.L.C. Talbot, "Gene Selection in Cancer Classification Using Sparse Logistic Regression with Bayesian Regularization," Bioinformatics, vol. 22, no. 19, pp. 2348-2355, 2006.
[17] R. Shen, D. Ghosh, A. Chinnaiyan, and Z. Meng, "Eigengene-Based Linear Discriminant Model for Tumor Classification Using Gene Expression Microarray Data," Bioinformatics, vol. 22, no. 21, pp. 2635-2642, 2006.
[18] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene Selection for Cancer Classification Using Support Vector Machines," Machine Learning, vol. 46, nos. 1-3, pp. 389-422, 2002.
[19] A. Rakotomamonjy, "Variable Selection Using SVM-Based Criteria," J. Machine Learning Research, vol. 3, pp. 1357-1370, 2003.
[20] X. Zhou and K.Z. Mao, "Ls Bound Based Gene Selection for DNA Microarray Data," Bioinformatics, vol. 21, no. 8, pp. 1559-1564, 2005.
[21] H.H. Zhang, J. Ahn, X. Lin, and C. Park, "Gene Selection Using Support Vector Machines with Non-Convex Penalty," Bioinformatics, vol. 22, no. 1, pp. 88-95, 2006.
[22] Y. Tang, Y.-Q. Tang, and Z. Huang, "Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 365-381, July-Sept. 2007.
[23] U. Braga-Neto and E.R. Dougherty, "Bolstered Error Estimation," Pattern Recognition, vol. 37, no. 6, pp. 1267-1281, 2004.
[24] D.V. Nguyen and D.M. Rocke, "Tumor Classification by Partial Least Squares Using Microarray Gene Expression Data," Bioinformatics, vol. 18, no. 1, pp. 39-50, 2002.
[25] C. Ding and H. Peng, "Minimum Redundancy Feature Selection for Gene Expression Data," Proc. IEEE CS Bioinformatics Conf. (CSB '03), pp. 523-529, Aug. 2003.
[26] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, pp. 531-537, 1999.
[27] S. Dudoit, J. Fridyand, and T.P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, pp. 77-87, 2002.
[28] C. Furlanello, M. Serafini, S. Merler, and G. Jurman, "Entropy-Based Gene Ranking without Selection Bias for the Predictive Classification of Microarray Data," BMC Bioinformatics, vol. 4, article no. 54, 2003.
[29] X. Liu, A. Krishnan, and A. Mondry, "Entropy-Based Gene Selection for Cancer Classification Using Microarray Data," BMC Bioinformatics, vol. 6, article no. 76, 2005.
[30] E.P. Xing, M.I. Jordan, and R.M. Karp, "Feature Selection for High-Dimensional Genomic Microarray Data," Proc. 18th Int'l Conf. Machine Learning, pp. 601-608, 2001.
[31] D. Huang and T.W. Chow, "Effective Gene Selection Method with Small Sample Sets Using Gradient-Based and Point Injection Techniques," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 467-475, July-Sept. 2007.
[32] C. Zhang, X. Lu, and X. Zhang, "Significance of Gene Ranking for Classification of Microarray Samples," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 3, no. 3, pp. 312-320, July-Sept. 2006.
[33] S.C. Shah and A. Kusiak, "Data Mining and Genetic Algorithm Based Gene/SNP Selection," Artificial Intelligence in Medicine, vol. 31, no. 3, pp. 189-196, 2004.
[34] T. Jirapech-Umpai and S. Aitken, "Feature Selection and Classification for Microarray Data Analysis: Evolutionary Methods for Identifying Predictive Genes," BMC Bioinformatics, vol. 6, article no. 148, 2005.
[35] M. Dash and H. Liu, "Feature Selection for Classification," Intelligent Data Analysis, vol. 1, pp. 131-156, 1997.
[36] A.N. Tikhonov and V.Y. Arsenin, Solutions of Ill-Posed Problems. W.H. Winston, 1977.
[37] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine, "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays," Proc. Nat'l Academy of Sciences USA, vol. 96, no. 12, pp. 6745-6750, 1999.
[38] G.J. Gordon, R.V. Jensen, L.-L. Hsiao, S.R. Gullans, J.E. Blumenstock, S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, and R. Bueno, "Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma," Cancer Research, vol. 62, pp. 4963-4967, 2002.
[39] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Ray, M.A. Koval, K.W. Last, A. Norton, T.A. Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub, "Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene-Expression Profiling and Supervised Machine Learning," Nature Medicine, vol. 8, no. 1, pp. 68-74, 2002.
[40] B. Efron and R. Tibshirani, "Improvements on Cross-Validation: The .632+ Bootstrap Method," J. Am. Statistical Assoc., vol. 92, no. 438, pp. 548-560, 1997.
[41] V.N. Vapnik, Statistical Learning Theory. John Wiley and Sons, 1998.
27 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool