Issue No.01 - Jan.-Feb. (2013 vol.10)
pp: 87-97
Jagath C. Rajapakse , Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Piyushkumar A. Mundra , Bioinf. Res. Center, Nanyang Technol. Univ., Singapore, Singapore
Filter methods are often used for selection of genes in multiclass sample classification by using microarray data. Such techniques usually tend to bias toward a few classes that are easily distinguishable from other classes due to imbalances of strong features and sample sizes of different classes. It could therefore lead to selection of redundant genes while missing the relevant genes, leading to poor classification of tissue samples. In this manuscript, we propose to decompose multiclass ranking statistics into class-specific statistics and then use Pareto-front analysis for selection of genes. This alleviates the bias induced by class intrinsic characteristics of dominating classes. The use of Pareto-front analysis is demonstrated on two filter criteria commonly used for gene selection: F-score and KW-score. A significant improvement in classification performance and reduction in redundancy among top-ranked genes were achieved in experiments with both synthetic and real-benchmark data sets.
Gene expression, Bioinformatics, Computational biology, Redundancy, Cancer, Training, Benchmark testing, Pareto-front analysis, Aggregation statistics, filter methods, gene selection, multiobjective evolutionary optimization
Jagath C. Rajapakse, Piyushkumar A. Mundra, "Multiclass Gene Selection Using Pareto-Fronts", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 1, pp. 87-97, Jan.-Feb. 2013, doi:10.1109/TCBB.2013.1
[1] I. Guyon and A. Elisseeff, “An Introduction to Variable and Feature Selection,” J. Machine Learning Research, vol. 3, pp. 1157-1182, 2003.
[2] K.-B. Duan, J.C. Rajapakse, H. Wang, and F. Azuaje, “Multiple SVM-RFE for Gene Selection in Cancer Classification with Expression Data,” IEEE Trans. Nanobiosciences, vol. 4, no. 3, pp. 228-234, Sept. 2005.
[3] P.A. Mundra and J.C. Rajapakse, “SVM-RFE with MRMR Filter for Gene Selection,” IEEE Trans. Nanobiosciences, vol. 9, no. 1, pp. 31-37, Mar. 2010.
[4] C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. de Schaetzen, R. Duque, H. Bersini, and A. Nowe, “A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 1106-1119, July/Aug. 2012.
[5] T. Li, C. Zhang, and M. Ogihara, “A Comparative Study of Feature Selection and Multiclass Calssification Methods for Tissue Classification Based on Gene Expression,” Bioinformatics, vol. 20, no. 15, pp. 2429-2437, 2004.
[6] S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” J Am. Statistical Assoc., vol. 97, no. 457, pp. 77-86, 2002.
[7] D. Chen, Z. Liu, X. Ma, and D. Hua, “Selecting Genes by Test Statistics,” J. Biomedicine and Biotechnology, vol. 2, pp. 132-138, 2005.
[8] C. Ding and H. Peng, “Minimum Redundancy Feature Selection from Microarray Gene Expression Data,” J. Bioinformatics Computational Biology, vol. 3, pp. 185-205, 2005.
[9] X. Liu, A. Krishnan, and A. Mondry, “An Entropy-Based Gene Selection Method for Cancer Classification Using Microarray Data,” BMC Bioinformatics, vol. 6, article 76, 2005.
[10] J.-H. Cho, D. Lee, J.H. Park, and I.-B. Lee, “New Gene Selection Method for Classification of Cancer Subtypes Considering within-Class Variation,” FEBS Letters, vol. 551, pp. 3-7, 2003.
[11] Q. Shen, W.-M. Shi, and W. Kong, “New Gene Selection Method for Multiclass Tumor Classification by Class Centroid,” J. Biomedical Informatics, vol. 42, no. 1, pp. 59-65, 2009.
[12] Y.-S. Tsai, C.-T. Lin, G. Tseng, I.-F. Chung, and N. Pal, “Discovery of Dominant and Dormant Genes from Expression Data Using a Novel Generalization of SNR for Multi-Class Problems,” BMC Bioinformatics, vol. 9, article 425, 2008.
[13] K. Kadota, Y. Nakai, and K. Shimizu, “A Weighted Average Difference Method for Detecting Differentially Expressed Genes from Microarray Data,” BMC Bioinformatics, vol. 3, article 8, 2008.
[14] C. Ooi, M. Chetty, and S. Teng, “Differential Prioritization between Relevance and Redundancy in Correlation-Based Feature Selection Techniques for Multiclass Gene Expression Data,” BMC Bioinformatics, vol. 7, article 320, 2006.
[15] R. Clarke et al., “The Properties of High-Dimensional Data Spaces: Implications for Exploring Gene and Protein Expression Data,” Nature Rev. Cancer, vol. 8, pp. 37-49, 2008.
[16] S. Armstrong et al., “MLL Translocations Specify a Distinct Gene Expression Profile That Distinguishes a Unique Leukemia,” Nature Genetics, vol. 30, no. 1, pp. 41-47, 2002.
[17] G. Forman, “A Pitfall and Solution in Multi-Class Feature Selection for Text Classification,” Proc. 21st Int'l Conf. Machine Learning, 2004.
[18] G. Fleury, A. Hero, S. Yoshida, T. Carter, C. Barlow, and A. Swaroop, “Pareto Analysis for Gene Filtering in Microarray Experiments,” Proc. European Signal Processing Conf. (EUSIPCO), 2002.
[19] A. Hero and G. Fleury, “Pareto-Optimal Methods for Gene Ranking,” J. VLSI Signal Processing, vol. 38, pp. 259-275, 2004.
[20] G. Fleury, A. Hero, S. Zareparsi, and A. Swaroop, “Gene Discovery Using Pareto Depth Sampling Distributions,” J. Franklin Inst.-Eng. and Applied Math., vol. 341, nos. 1/2, pp. 55-75, 2004.
[21] P.A. Mundra and J.C. Rajapakse, “F-Score with Pareto Front Analysis for Multiclass Gene Selection,” Proc. Int'l Conf. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBio '09), C. Pizzuti, M.D. Ritchie, and M. Giacobini, eds., Apr. 2009.
[22] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II,” IEEE Trans. Evolutionary Computation, vol. 6, no. 2, pp. 182-197, Apr. 2002.
[23] S. Mitra, “Hybridization with Rough Sets,” Proc. IEEE World Congress on Computational Intelligence, J. Aranda and S. Xambo, eds., July 2010.
[24] J. Hua, W. Tembe, and E. Dougherty, “Performance of Feature-Selection Methods in the Classification of High-Dimension Data,” Pattern Recognition, vol. 42, pp. 409-424, 2009.
[25] S. Ramaswamy et al., “Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 26, pp. 15149-15154, 2001.
[26] A. Bhattacharjee et al., “Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 24, pp. 13790-13795, 2001.
[27] D.T. Ross et al., “Systematic Variation in Gene Expression Patterns in Human Cancer Cell,” Nature Genetics, vol. 24, no. 3, pp. 227-235, 2000.
[28] J. Staunton et al., “Chemosensitivity Prediction by Transcriptional Profiling,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 19, pp. 10787-10792, 2001.
[29] A. Su et al., “Molecular Classification of Human Carcinomas by Use of Gene Expression Signatures,” Cancer Research, vol. 61, pp. 7388-7393, 2001.
[30] A. Culhane, G. Perriere, and D. Higgins, “Cross-Platform Comparison and Visualisation of Gene Expression Data Using Co-Inertia Analysis,” BMC Bioinformatics, vol. 4, no. 1, article 59, 2003.
[31] J. Gorodkin, “Comparing Two K-Category Assignment by a K-Category Correlation Coefficient,” Computational Biology and Chemistry, vol. 28, pp. 367-374, 2004.
[32] J. Liu, S. Kang, C. Tang, L.B. Ellis, and T. Li, “Meta-Prediction of Protein Subcellular Localization with Reduced Voting,” Nucleic Acid Research, vol. 35, no. 15, article e96, 2007.
[33] L. Kuncheva, “A Stability Index for Feature Selection,” Proc. 25th IASTED Int'l Conf. Artificial Intelligence and Applications, 2007.
[34] J. Liang, R. Li, H. Fang, and K.-T. Fang, “Testing Multinormality Based on Low-Dimensional Projection,” J. Statistical Planning and Inference, vol. 86, pp. 129-141, 2000.
[35] J. Rice, Mathematical Statistics and Data Analysis. Brooks/Cole, 2007.
[36] T. Abeel et al., “Robust Biomarker Identification for Cancer Diagnosis with Ensemble Feature Selection Methods,” Bioinformatics, vol. 26, no. 3, pp. 392-398, 2010.