This Article 
 Bibliographic References 
 Add to: 
A Top-r Feature Selection Algorithm for Microarray Gene Expression Data
May-June 2012 (vol. 9 no. 3)
pp. 754-764
S. Miyano, Lab. of DNA Inf. Anal., Univ. of Tokyo, Tokyo, Japan
S. Imoto, Lab. of DNA Inf. Anal., Univ. of Tokyo, Tokyo, Japan
A. Sharma, Lab. of DNA Inf. Anal., Univ. of Tokyo, Tokyo, Japan
Most of the conventional feature selection algorithms have a drawback whereby a weakly ranked gene that could perform well in terms of classification accuracy with an appropriate subset of genes will be left out of the selection. Considering this shortcoming, we propose a feature selection algorithm in gene expression data analysis of sample classifications. The proposed algorithm first divides genes into subsets, the sizes of which are relatively small (roughly of size h), then selects informative smaller subsets of genes (of size r <; h) from a subset and merges the chosen genes with another gene subset (of size r) to update the gene subset. We repeat this process until all subsets are merged into one informative subset. We illustrate the effectiveness of the proposed algorithm by analyzing three distinct gene expression data sets. Our method shows promising classification accuracy for all the test data sets. We also show the relevance of the selected genes in terms of their biological functions.

[1] F. Al-Shahrour, R. Díaz-Uriarte, and J. Dopazo, "FatiGO: A Web Tool for Finding Significant Associations of Gene Ontology Terms with Groups of Genes," Bioinformatics, vol. 20, pp. 578-580, 2004.
[2] S.A. Armstrong, J.E. Staunton, L.B. Silverman, R. Pieters, M.L. den Boer, M.D. Minden, S.E. Sallan, E.S. Lander, T.R. Golub, and S.J. Korsemeyer, "MLL Translocations Specify a Distinct Gene Expression Profile that Distinguishes a Unique Leukemia," Nature Genetics, vol. 30, pp. 41-47, 2002.
[3] M. Ben-Bassat, "Pattern Recognition and Reduction of Dimensionality," Handbook of Statistics II, P. Krishnaiah, and L. Kanal, eds. vol. 1, pp. 773-791, North-Holland, 1982.
[4] H. Chai and C. Domeniconi, "An Evaluation of Gene Selection Methods for Multi-Class Microarray Data Classification," Proc. Second European Workshop Data Mining and Text Mining in Bioinformatics, pp. 3-10, 2004.
[5] G. Cong, K.-L. Tan, A.K.H. Tung, and X. Xu, "Mining top-k Covering Rule Groups for Gene Expression Data," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 670-681, 2005.
[6] X. Cui, H. Zhao, and J. Wilson, "Optimized Ranking and Selection Methods for Feature Selection with Application in Microarray Experiments," J. Biopharmaceutical Statistics, vol. 20, no. 2, pp. 223-239, 2010.
[7] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. Wiley, 2000.
[8] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, pp. 531-537, 1999.
[9] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene Selection for Cancer Classification Using Support Vector Machines," Machine Learning, vol. 46, pp. 389-422, 2002.
[10] P. Jafari and F. Azuaje, "An Assessment of Recently Published Gene Expression Data Analyses: Reporting Experimental Design and Statistical Factors," BMC Medical Informatics Decision Making, vol. 6, article 27, 2006.
[11] J. Khan, J.S. Wei, M. Ringner, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, and P.S. Meltzer, "Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Network," Nature Medicine, vol. 7, pp. 673-679, 2001.
[12] J. Kittler, "Feature Set Search Algorithms," Pattern Recognition and Signal Processing, pp. 41-60, Sijthoff and Noordhoff, 1978.
[13] I. Inza, P. Larrañaga, R. Etxeberria, and B. Sierra, "Feature Subset Selection by Bayesian Networks Based Optimization," Artificial Intelligence, vol. 123, pp. 157-184, 2000.
[14] J. Li and L. Wong, "Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL," Proc. Fourth Int'l Conf. Advances in Web-Age Information Management, pp. 254-265, 2003.
[15] X. Lu, A. Gamst, and R. Xu, "RDcurve: A Non-Parametric Method to Evaluate the Stability of Ranking Procedures," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 4, pp. 719-726, Oct.-Dec. 2010.
[16] M.W. Mark and S.Y. Kung, "A Solution to the Curse of Dimensionality Problem in Pairwise Scoring Techniques," Proc. Int'l Conf. Neural Information Processing, pp. 314-323, 2006.
[17] H. Mamitsuka, "Selecting Features in Microarray Classification Using Roc Curves," Pattern Recognition, vol. 39, pp. 2393-2404, 2006.
[18] P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy, "Gene Functional Classification from Heterogeneous Data," Proc. Int'l Conf. Computational Biology, pp. 249-255, 2001.
[19] W. Pan, "A Comparative Review of Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments," Bioinformatics, vol. 18, pp. 546-554, 2002.
[20] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander, and T.R. Golub, "Multiclass Cancer Diagnosis using Tumor Gene Expression Signatures," Proc. Nat'l Academy of Sciences USA, vol. 98, no. 26, pp. 15149-15154, 2001.
[21] Y. Saeys, I. Inza, and P. Larranaga, "A Review of Feature Selection Techniques in Bioinformatics," Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007.
[22] W. Siedelecky and J. Sklansky, "On Automatic Feature Selection," Int'l J. Pattern Recognition, vol. 2, pp. 197-220, 1998.
[23] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D'Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, and W.R. Sellers, "Gene Expression Correlates of Clinical Prostate Cancer Behavior," Cancer Cell, vol. 1, pp. 203-209, 2002.
[24] A.C. Tan and D. Gilbert, "Ensemble Machine Learning on Gene Expression Data for Cancer Classification," Applied Bioinformatics, vol. 2, Suppl 3, pp. S75-S83, 2003.
[25] L. Tao, C. Zhang, and M. Ogihara, "A Comparative Study of Feature Selection and Multiclass Classification Methods for Tissue Classification Based on Gene Expression," Bioinformatics, vol. 20, no. 14, pp. 2429-2437, 2004.
[26] X. Wang and O. Gotoh, "Cancer Classification Using Single Genes," Genome Informatics, vol. 23, pp. 179-188, 2009.
[27] L. Yu and H. Liu, "Efficient Feature Selection via Analysis of Relevance and Redundancy," J. Machine Learning Research, vol. 5, pp. 1205-1224, 2004.
[28] C. Zhang, X. Lu, and X. Zhang, "Significance of Gene Ranking for Classification of Microarray Samples," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 3, no. 3, pp. 312-320, July-Sept. 2006.
[29] Machine Learning in Bioinformatics, Y.-Q. Zhang and J.C. Rajapakse, eds. Wiley Publication, 2009.

Index Terms:
set theory,bioinformatics,data analysis,feature extraction,genetics,lab-on-a-chip,biological functions,top-r feature selection algorithm,microarray gene expression data,classification accuracy,gene expression data analysis,gene subset,informative subset,Accuracy,Gene expression,Bioinformatics,Classification algorithms,Cancer,Algorithm design and analysis,DNA microarray gene expression data.,Feature selection,classification accuracy,top-r features
S. Miyano, S. Imoto, A. Sharma, "A Top-r Feature Selection Algorithm for Microarray Gene Expression Data," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 3, pp. 754-764, May-June 2012, doi:10.1109/TCBB.2011.151
Usage of this product signifies your acceptance of the Terms of Use.