Subscribe
Issue No.06 - Nov.-Dec. (2012 vol.9)
pp: 1649-1662
Meng-Yun Wu , Dept. of Math., Sun Yat-Sen Univ., Guangzhou, China
Dao-Qing Dai , Dept. of Math., Sun Yat-Sen Univ., Guangzhou, China
Yu Shi , Sch. of Math. & Stat., Zhengzhou Normal Univ., Zhengzhou, China
Hong Yan , Dept. of Electron. Eng., City Univ. of Hong Kong, Kowloon, China
Xiao-Fei Zhang , Dept. of Math., Sun Yat-Sen Univ., Guangzhou, China
ABSTRACT
Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain outliers due to either chemical or electrical reasons. A good gene selection method should take group effects into account and be robust to outliers. In this paper, we propose a Laplace naive Bayes model with mean shrinkage (LNB-MS). The Laplace distribution instead of the normal distribution is used as the conditional distribution of the samples for the reasons that it is less sensitive to outliers and has been applied in many fields. The key technique is the L_1 penalty imposed on the mean of each class to achieve automatic feature selection. The objective function of the proposed model is a piecewise linear function with respect to the mean of each class, of which the optimal value can be evaluated at the breakpoints simply. An efficient algorithm is designed to estimate the parameters in the model. A new strategy that uses the number of selected features to control the regularization parameter is introduced. Experimental results on simulated data sets and 17 publicly available cancer data sets attest to the accuracy, sparsity, efficiency, and robustness of the proposed algorithm. Many biomarkers identified with our method have been verified in biochemical or biomedical research. The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously.
INDEX TERMS
Cancer, Gene expression, Support vector machines, Biological system modeling, Computational modeling,gene expression data analysis, Biomarker identification, cancer classification, Laplace distribution, L_1 penalty
CITATION
Meng-Yun Wu, Dao-Qing Dai, Yu Shi, Hong Yan, Xiao-Fei Zhang, "Biomarker Identification and Cancer Classification Based on Microarray Data Using Laplace Naive Bayes Model with Mean Shrinkage", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 6, pp. 1649-1662, Nov.-Dec. 2012, doi:10.1109/TCBB.2012.105
REFERENCES
 [1] B. Ariel and G. Pablo, “Clustering Gene Expression Data with a Penalized Graph-Based Metric,” BMC Bioinformatics, vol. 12, article 2, 2011. [2] X.X. Han, “Nonnegative Principal Component Analysis for Cancer Molecular Pattern Discovery,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 537-549, July-Sept. 2010. [3] L. Yu, Y. Han, and M.E. Berens, “Stable Gene Selection from Microarray Data via Sample Weighting,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 9, no. 1, pp. 262-272, Jan./Feb. 2012. [4] Y.J. Lu et al., “Learning Microarray Gene Expression Data by Hybrid Discriminant Analysis,” IEEE MultiMedia, vol. 14, no. 4, pp. 22-31, Oct.-Dec. 2007. [5] S. Ma and J. Huang, “Penalized Feature Selection and Classification in Bioinformatics,” Briefings in Bioinformatics, vol. 9, no. 5, pp. 392-403, 2008. [6] Y. Saeys, I. Inza, and P. Larranaga, “A Review of Feature Selection Techniques in Bioinformatics,” Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007. [7] M. Hilario and A. Kalousis, “Approaches to Dimensionality Reduction in Proteomic Biomarker Studies,” Briefings in Bioinformatics, vol. 9, no. 2, pp. 102-118, 2008. [8] S. Zhu et al., “Feature Selection for Gene Expression Using Model-Based Entropy,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 25-36, Jan.-Mar. 2010. [9] X. Wei and K.C. Li, “Exploring the within-and between-Class Correlation Distributions for Tumor Classification,” Proc. Nat'l Academy of Sciences USA, vol. 107, no. 15, pp. 6737-6742, 2010. [10] K.B. Duan et al., “Multiple SVM-RFE for Gene Selection in Cancer Classification with Expression Data,” IEEE Trans. NanoBioscience, vol. 4, no. 3, pp. 228-234, Sept. 2005. [11] H.D. Li et al., “Recipe for Uncovering Predictive Genes Using Support Vector Machines Based on Model Population Analysis,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 6, pp. 1633-1641, Nov./Dec. 2011. [12] H. Liu et al., “Feature Selection: An Ever Evolving Frontier in Data Mining,” JMLR: The Fourth International Workshop on Feature Selection in Data Mining, vol. 10, pp. 4-13, 2010. [13] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” J. Royal Statistical Soc.: Series B, vol. 58, no. 1, pp. 267-288, 1996. [14] H. Zou and T. Hastie, “Regularization and Variable Selection via the Elastic Net,” J. Royal Statistical Soc.: Series B, vol. 67, no. 2, pp. 301-320, 2005. [15] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables,” J. Royal Statistical Soc.: Series B, vol. 68, no. 1, pp. 49-67, 2006. [16] J. Zhu et al., “1-Norm Support Vector Machines,” Advances in Neural Information Processing Systems, vol. 16, no. 1, pp. 49-56, 2004. [17] L. Wang, J. Zhu, and H. Zou, “The Doubly Regularized Support Vector Machine,” Statistica Sinica, vol. 16, no. 2, pp. 589-615, 2006. [18] Z. Liu, S. Lin, and M. Tan, “Sparse Support Vector Machines with $l_p$ Penalty for Biomarker Identification,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 100-107, Jan.-Mar. 2010. [19] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001. [20] S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, no. 457, pp. 77-87, 2002. [21] R. Caruana and A. Niculescu-Mizil, “An Empirical Comparison of Supervised Learning Algorithms,” Proc. 23rd Int'l Conf. Machine Learning, pp. 161-168. 2006, [22] S. Kotz, T.J. Kozubowski, and K. Podgorski, The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance. Birkhauser, 2001. [23] M. Mechelke and M. Habeck, “Robust Probabilistic Superposition and Comparison of Protein Structures,” BMC Bioinformatics, vol. 11, article 363, 2010. [24] W. Pan and X.T. Shen, “Penalized Model-Based Clustering with Application to Variable Selection,” J. Machime Learning Research, vol. 8, pp. 1145-1164, 2007. [25] D. Ghosh and A.M. Chinnaiyan, “Classification and Selection and Biomarkers in Genomic Data Using Lasso,” J. Biomedicine and Biotechnology, vol. 2, pp. 147-154, 2005. [26] S.K. Shevade and S.S. Keerthi, “A Simple and Efficient Algorithm for Gene Selecting Using Sparse Logistic Regression,” Bioinformatics, vol. 19, no. 17, pp. 2246-2253, 2003. [27] G.O. Consortium, “Gene Ontology: Tool for the Unification of Biology,” Nature Genetics, vol. 25, no. 1, pp. 25-29, 2000. [28] W.H. Yang, D.Q. Dai, and H. Yan, “Finding Correlated Biclusters from Gene Expression Data,” IEEE Trans. Knowledge and Data Eng., vol. 23, no. 4, pp. 568-584, Apr. 2011. [29] S. Oba et al., “A Bayesian Missing Value Estimation Method for Gene Expression Profile Data,” Bioinformatics, vol. 19, no. 16, pp. 2088-2096, 2003. [30] U. Alon et al., “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy of Sciences USA, vol. 96, no. 12, pp. 6745-6750, 1999. [31] T.R. Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, no. 5439, pp. 531-537, 1999. [32] G.J. Gordon et al., “Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma,” Cancer Research, vol. 62, no. 17, pp. 4963-4967, 2002. [33] M.A. Shipp et al., “Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene-Expression Profiling and Supervised Machine Learning,” Nature Medicine, vol. 8, no. 1, pp. 68-74, 2002. [34] D. Singh et al., “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell, vol. 1, no. 2, pp. 203-209, 2002. [35] R.O. Stuart et al., “In Silico Dissection of Cell-Type-Associated Patterns of Gene Expression in Prostate Cancer,” Proc. Nat'l Academy of Sciences USA, vol. 101, no. 2, pp. 615-620, 2004. [36] J.B. Welsh et al., “Analysis of Gene Expression Identifies Candidate Markers and Pharmacological Targets in Prostate Cancer,” Cancer Research, vol. 61, no. 16, pp. 5974-5978, 2001. [37] M.E. Garber et al., “Diversity of Gene Expression in Adenocarcinoma of the Lung,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 24, pp. 13784-13789, 2001. [38] J. Lapointe et al., “Gene Expression Profiling Identifies Clinically Relevant Subtypes of Prostate Cancer,” Proc. Nat'l Academy of Sciences USA, vol. 101, no. 3, pp. 811-816, 2004. [39] S.L. Pomeroy et al., “Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression,” Nature, vol. 415, no. 6870, pp. 436-442, 2002. [40] A.A. Alizadeh et al., “Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling,” Nature, vol. 403, no. 6769, pp. 503-511, 2000. [41] S.A. Armstrong et al., “MLL Translocations Specify a Distinct Gene Expression Profile that Distinguishes a Unique Leukemia,” Nature Genetics, vol. 30, no. 1, pp. 41-47, 2001. [42] J. Khan et al., “Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks,” Nature Medicine, vol. 7, no. 6, pp. 673-679, 2001. [43] A. Bhattacharjee et al., “Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 24, pp. 13790-13795, 2001. [44] E.J. Yeoh et al., “Classification, Subtype Discovery, and Prediction of Outcome in Pediatric Acute Lymphoblastic Leukemia by Gene Expression Profiling,” Cancer Cell, vol. 1, no. 2, pp. 133-143, 2002. [45] A.I. Su et al., “Molecular Classification of Human Carcinomas by Use of Gene Expression Signatures,” Cancer Research, vol. 61, no. 20, pp. 7388-7393, 2001. [46] L. Wang, “Feature Selection with Kernel Class Separability,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp. 1534-1546, Sept. 2008. [47] E.I. Boyle et al., “Go:: Termfinder-Open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes,” Bioinformatics, vol. 20, no. 18, pp. 3710-3715, 2004. [48] X.F. Zhang, D.D. Dai, and X.X. Li, “Protein Complexes Discovery Based on Protein-Protein Interaction Data via a Regularized Sparse Generative Network Model,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 9, no. 3, pp. 857-870, May/June 2012. [49] Z. Zhu, Y.S. Ong, and J.M. Zurada, “Identification of Full and Partial Class Relevant Genes,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 2, pp. 263-277, Apr.-June 2010. [50] D. Hernández-Lobato, J.M. Hernández-Lobato, and A. Suárez, “Expectation Propagation for Microarray Data Classification,” Pattern Recognition Letters, vol. 31, no. 12, pp. 1618-1626, 2010. [51] E.B. Huerta, B. Duval, and J.K. Hao, “Fuzzy Logic for Elimination of Redundant Information of Microarray Data,” Genomics, Proteomics and Bioinformatics, vol. 6, no. 2, pp. 61-73, 2008. [52] M.L. Linenberger, “CD33-Directed Therapy with Gemtuzumab Ozogamicin in Acute Myeloid Leukemia: Progress in Understanding Cytotoxicity and Potential Mechanisms of Drug Resistance,” Leukemia, vol. 19, no. 2, pp. 176-182, 2005. [53] T. Decker et al., “Cell Cycle Progression of Chronic Lymphocytic Leukemia Cells is Controlled by Cyclin D2, Cyclin D3, Cyclin-Dependent Kinase (cdk) 4 and the cdk Inhibitor p27,” Leukemia, vol. 16, no. 3, pp. 327-334, 2002. [54] V. Buccheri et al., “MB-1: A New Marker for B-Lineage Lymphoblastic Leukemia,” Blood, vol. 82, no. 3, pp. 853-857, 1993. [55] T. Macalma et al., “Molecular Characterization of Human Zyxin,” J. Biological Chemistry, vol. 279, no. 49, pp. 31470-31478, 1996. [56] E. Sakhinia et al., “Routine Expression Profiling of Microarray Gene Signatures in Acute Leukaemia by Real-Time PCR of Human Bone Marrow,” British J. Haematology, vol. 130, no. 2, pp. 233-248, 2005. [57] R. Jorquera and R.M. Tanguary, “Cyclin B-Dependent Kinase and Caspase-1 Activation Precedes Mitochondrial Dysfunction in Fumarylacetoacetate-Induced Apoptosis,” The FASEB J., vol. 13, no. 15, pp. 2284-2298, 1999. [58] T. Shimizu et al., “Unscheduled Activation of Cydin B1/Cdc2 Kinase in Human Promyelocytic Leukemia Cell Line HL60 Cells Undergoing Apoptosis Induced by DNA Damage,” Cancer Research, vol. 55, no. 2, pp. 228-231, 1995. [59] W. Ross et al., “Role of Topoisomerase II in Mediating Epipodophyllotoxin-Induced DNA Cleavage,” Cancer Research, vol. 44, no. 12, pp. 5857-5860, 1984. [60] T. Taki et al., “The MYO1F, Unconventional Myosin Type 1F, Gene is Fused to MLL in Infant Acute Monocytic Leukemia with a Complex Translocation Involving Chromosomes 7, 11, 19 and 22,” Oncogene, vol. 24, no. 33, pp. 5191-5197, 2005. [61] N.M. Verrills et al., “Proteomic Analysis Reveals a Novel Role for the Actin Cytoskeleton in Vincristine Resistant Childhood Leukemia-an in Vivo Study,” Proteomics, vol. 6, no. 5, pp. 1681-1694, 2006. [62] D.L. Tong, “Hybridising Genetic Algorithm-Neural Network (GANN) in Marker Genes Detection,” Proc. Int'l Conf. Machine Learning and Cybernetics, vol. 2, pp. 1082-1087, 2009. [63] H.L. Huang and F.L. Chang, “ESVM: Evolutionary Support Vector Machine for Automatic Feature Selection and Classification of Microarray Data,” Biosystems, vol. 90, pp. 516-528, 2007. [64] Y.S. Tsai et al., “Discovery of Dominant and Dormant Genes from Expression Data Using a Novel Generalization of SNR for Multi-Class Problems,” BMC Bioinformatics, vol. 9, article 425, 2008. [65] S.L. Wang et al., “Tumor Classification by Combining PNN Classifier Ensemble with Neighborhood Rough Set Based Gene Reduction,” Computers in Biology and Medicine, vol. 40, no. 2, pp. 179-189, 2010. [66] A. Mukhopadhyay, S. Bandyopadhyay, and U. Maulik, “Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification,” PloS One, vol. 5, no. 11, p. e13803, 2010. [67] O.M. El-Badry et al., “Insulin-Like Growth Factor II Acts as an Autocrine Growth and Motility Factor in Human Rhabdomyosarcoma Tumors.” Cell Growth and Differentiation, vol. 1, no. 7, pp. 325-336, 1990. [68] S.A. Tahir et al., “Secreted Caveolin-1 Stimulates Cell Survival/Clonal Growth and Contributes to Metastasis in Androgen-Insensitive Prostate Cancer,” Cancer Research, vol. 61, no. 10, pp. 3882-3885, 2001. [69] S.W. Fine et al., “Elevated Expression of Caveolin-1 in Adenocarcinoma of the Colon,” Am. J. Clinical Pathology, vol. 115, no. 5, pp. 719-724, 2001. [70] M.L. Weir and J. Muschler, “Dystroglycan: Emerging Roles in Mammary Gland Function,” J. Mammary Gland Biology and Neoplasia, vol. 8, no. 4, pp. 409-419, 2003. [71] C. Baer et al., “Profiling and Functional Annotation of mRNA Gene Expression in Pediatric Rhabdomyosarcoma and Ewing's Sarcoma,” Int'l J. Cancer, vol. 110, no. 5, pp. 687-694, 2004.