The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January-March (2010 vol.7)
pp: 100-107
Zhenqiu Liu , University of Maryland, Baltimore
Shili Lin , Ohio State University, Columbus
Ming T. Tan , University of Maryland, Baltimore
ABSTRACT
The development of high-throughput technology has generated a massive amount of high-dimensional data, and many of them are of discrete type. Robust and efficient learning algorithms such as LASSO [1] are required for feature selection and overfitting control. However, most feature selection algorithms are only applicable to the continuous data type. In this paper, we propose a novel method for sparse support vector machines (SVMs) with L_{p} (p < 1) regularization. Efficient algorithms (LpSVM) are developed for learning the classifier that is applicable to high-dimensional data sets with both discrete and continuous data types. The regularization parameters are estimated through maximizing the area under the ROC curve (AUC) of the cross-validation data. Experimental results on protein sequence and SNP data attest to the accuracy, sparsity, and efficiency of the proposed algorithm. Biomarkers identified with our methods are compared with those from other methods in the literature. The software package in Matlab is available upon request.
INDEX TERMS
Embedded method, feature selection, L_{p} regularization, SVM, SNP data analysis, protease data analysis.
CITATION
Zhenqiu Liu, Shili Lin, Ming T. Tan, "Sparse Support Vector Machines with L_{p} Penalty for Biomarker Identification", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 1, pp. 100-107, January-March 2010, doi:10.1109/TCBB.2008.17
REFERENCES
[1] R. Tibshirani, "Regression Shrinkage and Selection via the Lasso," J. Royal Statistical Soc. B, vol. 58, no. 1, pp. 267-288, 1996.
[2] T. Bo and I. Jonassen, "New Feature Subset Selection Procedures for Classification of Expression Profiles," Genome Biology, vol. 3, no. 4, p. 0017, 2002.
[3] M.L. Chow, E.J. Moler, and I.S. Mian, "Identifying Marker Genes in Transcription Profiling Data Using a Mixture of Feature Relevance Experts," Physiological Genomics, vol. 5, pp. 99-111, Mar. 2001.
[4] M.K. Kerr, M. Martin, and G.A. Churchill, "Analysis of Variance for Gene Expression Microarray Data," J. Computational Biology, vol. 7, pp. 819-837, 2000.
[5] A. Long, H. Mangalam, B. Chan, L. Tolleri, G. Hatfield, and P. Baldi, "Improved Statistical Inference from DNA Microarray Data Using Analysis of Variance and a Bayesian Statistical Framework," J. Biological Chemistry, vol. 276, pp. 19937-19944, 2001.
[6] M.A. Newton, C.M. Kendziorski, C.S. Richmond, F.R. Blattner, and K.W. Tsui, "On Differential Variability of Expression Ratios: Improving Statistical Inference About Gene Expression Changes from Microarray Data," J. Computational Biology, vol. 8, no. 1, pp. 37-52, 2001.
[7] J. Yu and X.W. Chen, "Bayesian Neural Network Approaches to Ovarian Cancer Identification from High-Resolution Mass Spectrometry Data," Bioinformatics, vol. 21, no. suppl-1, pp. i487-i494, 2005.
[8] P. Pavlidis and W.S. Noble, "Analysis of Strain and Regional Variation in Gene Expression in Mouse Brain," Genome Biology, vol. 2, no. 10,research 0042.1-0042.15, 2001.
[9] R. Kohavi and G.H. John, The Wrapper Approach, in Feature Selection for Knowledge Discovery and Data Mining, H. Liu and H. Motoda, eds., pp. 33-50, Kluwer Academic Publishers, 1998.
[10] G. Monari and G. Dreyfus, "Withdrawing an Example from the Training Set: An Analytic Estimation of Its Effect on a Nonlinear Parameterized Model," Neurocomputing Letters, vol. 35, pp. 195-201, 2000.
[11] B. Inza, R.S. Blanco, and P.L. Naga, "Gene Selection by Sequential Search Wrapper Approaches in Microarray Cancer Class Prediction," J. Intelligent and Fuzzy Systems, 2002.
[12] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene Selection for Cancer Classification Using Support Vector Machines," Machine Learning, vol. 46, pp. 389-422, 2002.
[13] X. Zhou and D.P. Tuck, "MSVM-RFE: Extensions of SVM-RFE for Multiclass Gene Selection on DNA Microarray Data," Bioinformatics, vol. 23, no. 9, pp. 1106-1114, 2007.
[14] I. Rivals and L. Personnaz, "MLPs (Mono-Layer Polynomials and Multi-Layer Perceptrons) for Nonlinear Modeling," J. Machine Learning Research, vol. 3, pp. 1383-1398, 2003.
[15] R. Tibshirani, "The Lasso Method for Variable Selection in the Cox Model," Statistics in Medicine, vol. 16, pp. 385-395, 1997.
[16] V. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[17] M.E. Tipping, "Sparse Bayesian Learning and the Relevance Vector Machine," J. Machine Learning Research, vol. 1, pp. 211-244, 2001.
[18] P.S. Bradley and O.L. Mangasarian, "Feature Selection via Concave Minimization and Support Vector Machines," Proc. 15th Int'l Conf. Machine Learning (ICML '98), pp. 82-90, 1998.
[19] H.H. Zhang, J. Ahn, X. Lin, and C. Park, "Gene Selection Using Support Vector Machines with Non-Convex Penalty," Bioinformatics, vol. 22, no. 1, pp. 88-95, 2006.
[20] J. Fan and R. Li, "Variable Selection via Penalized Likelihood," J. Am. Statistical Assoc., vol. 96, pp. 1348-1360, 2001.
[21] Z. Liu, F. Jiang, G.L. Tian, S. Wang, F. Sato, S.J. Meltzer, and M. Tan, "Sparse Logistic Regression with Lp Penalty for Biomarker Identification," Statistical Applications in Genetics and Molecular Biology, vol. 6, no. 1, p. 2, 2007.
[22] K. Knight and W.J. Fu, "Asymptotics for Lasso-Type Estimators," Annals of Statistics, vol. 28, pp. 1356-1378, 2000.
[23] D.M. Malioutov, M. Cetin, and A.S. Willsky, "A Sparse Signal Reconstruction Perspective for Source Localization with Sensor Arrays," IEEE Trans. Signal Processing, vol. 53, no. 8, pp. 3010-3022, 2005.
[24] A.P. Bradley, "The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms," Pattern Recognition, vol. 30, pp. 1145-1159, 1997.
[25] X. Ling, J. Huang, and H. Zhang, "AUC: A Statistically Consistent and More Discriminating Measure Than Accuracy," Proc. 18th Int'l Joint Conf. Artificial Intelligence (IJCAI), 2003.
[26] J.D.R. Rennie and N. Srebro, "Loss Functions for Preference Levels: Regression with Discrete Ordered Labels," Proc. Int'l Joint Conf. Artificial Intelligence Multidisciplinary Workshop Advances in Preference Handling, 2005.
[27] C.M. Bishop, Neural Networks for Pattern Recognition. Oxford Univ. Press, 1995.
[28] T. Fawcett, ROC Graphs: Notes and Practical Considerations for Researchers, technical report, HP Laboratories, Palo Alto, CA, 2004.
[29] A. Rakotomamonjy, "Optimizing AUC with Support Vector Machine (SVM)," Proc. First Workshop ROC Analysis in Artificial Intelligence (ROCAI), 2004.
[30] M.R. Segal, K.D. Dahlquist, and B.R. Conklin, "Regression Approaches for Microarray Data Analysis," J. Computational Biology, vol. 10, pp. 961-980, 2003.
[31] T. Rognvaldsson and L. You, "Why Neural Networks Should Not Be Used for HIV-1 Protease Cleavage Site Prediction," Bioinformatics, vol. 20, no. 11, pp. 1702-1709, 2004.
[32] Z.Q. Beck, L. Hervio, P.E. Dawson, and J.E. Elder, "Identification of Efficiently Cleaved Substrates for HIV-1 Protease Using a Phage Display Library and Use in Inhibitor Development," Virology, vol. 274, pp. 391-401, 2000.
[33] Z.Q. Beck, Y.C. Lin, and J.E. Elder, "Molecular Basis for the Relative Substrate Specificity of Human Immunodeficiency Virus Type 1 and Feline Immunodeficiency Virus Proteases," J. Virology, vol. 75, pp. 9458-9469, 2001.
[34] J. Tozser, G. Zahuczky, P. Bagossi, J.M. Louis, T.D. Copeland, S. Oroszlan, R.W. Harrison, and I.T. Weber, "Comparison of the Substrate Specificity of the Human T-Cell Leukemia Virus and Human Immunodeficiency Virus Proteinases," European J. Biochemistry, vol. 267, pp. 6287-6295, 2000.
[35] T. Mailund, S. Besenbacher, and M.H. Schierup, "Whole Genome Association Mapping by Incompatibilities and Local Perfect Phylogenies," BMC Bioinformatics, vol. 7, p. 454, 2006.
[36] C.J. Verzilli, N. Stallard, and J.C. Whittaker, "Bayesian Graphical Models for Genomewide Association Studies," Am. J. Human Genetics, vol. 79, pp. 100-112, 2006.
40 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool