This Article 
 Bibliographic References 
 Add to: 
A Bayesian Approach to Joint Feature Selection and Classifier Design
September 2004 (vol. 26 no. 9)
pp. 1105-1111
This paper adopts a Bayesian approach to simultaneously learn both an optimal nonlinear classifier and a subset of predictor variables (or features) that are most relevant to the classification task. The approach uses heavy-tailed priors to promote sparsity in the utilization of both basis functions and features; these priors act as regularizers for the likelihood function that rewards good classification on the training data. We derive an expectation-maximization (EM) algorithm to efficiently compute a maximum a posteriori (MAP) point estimate of the various parameters. The algorithm is an extension of recent state-of-the-art sparse Bayesian classifiers, which in turn can be seen as Bayesian counterparts of support vector machines. Experimental comparisons using kernel classifiers demonstrate both parsimonious feature selection and excellent classification accuracy on a range of synthetic and benchmark data sets.

[1] J. Albert and S. Chib, Bayesian Analysis of Binary and Polychotomous Response Data J. Am. Statistical Assoc., vol. 88, pp. 669-679, 1993.
[2] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, Tissue Classification with Gene Expression Profiles Proc. Fourth Ann. Int'l Conf. Computational Molecular Biology (RECOMB 2000), 2000.
[3] R. Duda, P. Hart, and D. Stork, Pattern Classification. New York: John Wiley&Sons, 2001.
[4] M.A.T. Figueiredo, Adaptive Sparseness for Supervised Learning IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, pp. 1150-1159, 2003.
[5] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, Gene Selection for Cancer Classification Using Support Vector Machines Machine Learning, vol. 46, nos. 1-3, pp. 389-422, 2002.
[6] R. Herbrich, Learning Kernel Classifiers. Cambridge, Mass.: MIT Press, 2002.
[7] D. Husmeier, W. Penny, and S.J. Roberts, An Empirical Evaluation of Bayesian Sampling with Hybrid Monte Carlo for Training Neural Network Classifiers Neural Networks, vol. 12, pp. 677-705, 1999.
[8] D. Husmeier and S.J. Roberts, Regularisation of RBF-Networks with the Bayesian Evidence Scheme Proc. Int'l Conf. Artificial Neural Networks (ICANN99), pp. 533-538, 1999.
[9] B. Krishnapuram, A.J. Hartemink, and L. Carin, Logistic Regression and RVM for Cancer Diagnosis from Gene Expression Signatures Proc. 2002 Workshop Genomic Signal Processing and Statistics (GENSIPS), 2002.
[10] B. Krishnapuram, L. Carin, and A.J. Hartemink, Joint Classifier and Feature Optimization for Cancer Diagnosis Using Gene Expression Data Proc. Seventh Ann. Int'l Conf. Computational Molecular Biology (RECOMB 2003), 2003.
[11] Y.-J. Lee and O.L. Mangasarian, RSVM: Reduced Support Vector Machines Proc. SIAM Int'l Conf. Data Mining, 2001.
[12] P. McCullagh and J. Nelder, Generalized Linear Models. London: Chapman and Hall, 1989.
[13] R.M. Neal, Bayesian Learning for Neural Networks. New York: Springer Verlag, 1996.
[14] M. Seeger, Bayesian Model Selection for Support Vector Machines, Gaussian Processes, and Other Kernel Classifiers Proc. Advances in Neural Information Processing Systems (NIPS) 12, 2000.
[15] X. Sun and Y. Bao, A Kronecker Product Representation of the Fast Gauss Transform SIAM J. Matrix Analysis and Applications, vol. 24, no. 3, pp. 768-786, 2003.
[16] B. Schölkopf and A. Smola, Learning with Kernels. Cambridge, Mass.: MIT Press, 2002.
[17] R. Tibshirani, Regression Shrinkage and Selection via the LASSO J. Royal Statistical Soc. (B), vol. 58, pp. 267-288, 1996.
[18] M.E. Tipping, Sparse Bayesian Learning and the Relevance Vector Machine J. Machine Learning Research, vol. 1, pp. 211-244, 2001.
[19] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, Feature Selection for SVMs Proc. Advances in Neural Information Processing Systems (NIPS) 12, 2000.
[20] C. Williams and D. Barber, Bayesian Classification with Gaussian Priors IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 12, pp. 1342-1351, Dec. 1998.

Index Terms:
Pattern recognition, statistical learning, feature selection, sparsity, support vector machines, relevance vector machines, sparse probit regression, automatic relevance determination, EM algorithm.
Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, M?rio A.T. Figueiredo, "A Bayesian Approach to Joint Feature Selection and Classifier Design," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp. 1105-1111, Sept. 2004, doi:10.1109/TPAMI.2004.55
Usage of this product signifies your acceptance of the Terms of Use.