This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds
June 2005 (vol. 27 no. 6)
pp. 957-968
Recently developed methods for learning sparse classifiers are among the state-of-the-art in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsity-promoting priors encouraging the weight estimates to be either significantly large or exactly zero. From a learning-theoretic perspective, these methods control the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization. This paper presents three contributions related to learning sparse classifiers. First, we introduce a true multiclass formulation based on multinomial logistic regression. Second, by combining a bound optimization approach with a component-wise update procedure, we derive fast exact algorithms for learning sparse multiclass classifiers that scale favorably in both the number of training samples and the feature dimensionality, making them applicable even to large data sets in high-dimensional feature spaces. To the best of our knowledge, these are the first algorithms to perform exact multinomial logistic regression with a sparsity-promoting prior. Third, we show how nontrivial generalization bounds can be derived for our classifier in the binary case. Experimental results on standard benchmarkdata sets attest to the accuracy, sparsity, and efficiency of the proposed methods.

[1] P. Bartlett and S. Mendelson, “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results,” J. Machine Learning Research, vol. 3, pp. 463-482, 2002.
[2] D. Böhning, “Multinomial Logistic Regression Algorithm,” Annals of the Inst. of Statistical Math., vol. 44, pp. 197-200, 1992.
[3] D. Böhning and B. Lindsay, “Monotonicity of Quadratic-Approximation Algorithms,” Annals of the Inst. of Statistical Math., vol. 40, pp. 641-663, 1988.
[4] S. Chen, D. Donoho, and M. Saunders, “Atomic Decomposition by Basis Pursuit,” SIAM J. Scientific Computation, vol. 20, pp. 33-61, 1998.
[5] M. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge, U.K.: Cambridge Univ. Press, 2000.
[6] L. Csato and M. Opper, “Sparse Online Gaussian Processes,” Neural Computation, vol. 14, no. 3, pp. 641-668, 2002.
[7] J. de Leeuw and G. Michailides, “Block Relaxation Methods in Statistics,” technical report, Dept. of Statistics, Univ. of California at Los Angeles, 1993.
[8] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977.
[9] D. Donoho and M. Elad, “Optimally Sparse Representations in General Nonorthogonal Dictionaries by $l_1$ Minimization,” Proc. Nat'l Academy of Science, vol. 100, no. 5, pp. 2197-2202, 2003.
[10] M. Figueiredo and A. Jain, “Bayesian Learning of Sparse Classifiers,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 35-41, 2001.
[11] M. Figueiredo, “Adaptive Sparseness for Supervised Learning,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, pp. 1150-1159, 2003.
[12] J. Friedman, T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, “Discussion of Boosting Papers,” The Annals of Statistics, vol. 32, no. 1, pp. 102-107, 2004.
[13] T. Graepel, R. Herbrich, and J. Shawe-Taylor, “Generalisation Error Bounds for Sparse Linear Classifiers,” Proc. Conf. Computational Learning Theory, pp. 298-303, 2000.
[14] T. Graepel, R. Herbrich, and R.C. Williamson, “From Margin to Sparsity,” Proc. Neural Information Processing Systems (NIPS) 13, pp. 210-216, 2001.
[15] R. Herbrich, Learning Kernel Classifiers: Theory and Algorithms. Cambridge, Mass.: MIT Press, 2002.
[16] B. Krishnapuram, L. Carin, and A. Hartemink, “Joint Classifier and Feature Optimization for Cancer Diagnosis Using Gene Expression Data,” Proc. Int'l Conf. Research in Computational Molecular Biology, 2003.
[17] B. Krishnapuram, A. Hartemink, L. Carin, and M. Figueiredo, “A Bayesian Approach to Joint Feature Selection and Classifier Design,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, pp. 1105-1111, 2004.
[18] K. Lange, Optimization. New York: Springer Verlag, 2004.
[19] K. Lange, D. Hunter, and I. Yang, “Optimization Transfer Using Surrogate Objective Functions,” J. Computational and Graphical Statistics, vol. 9, pp. 1-59, 2000.
[20] J. Langford and J. Shawe-Taylor, “PAC-Bayes and Margins,” Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer, eds., pp. 423-430, Cambridge, Mass.: MIT Press, 2003.
[21] J. Langford, “Practical Prediction Theory for Classification,” Proc. Int'l Conf. Machine Learning, T. Fawcett and N. Mishra, eds., 2003.
[22] N.D. Lawrence, M. Seeger, and R. Herbrich, “Fast Sparse Gaussian Process Methods: The Informative Vector Machine,” Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun and K. Obermayer, eds., pp. 609-616, Cambridge, Mass.: MIT Press, 2003.
[23] M. Lewicki and T. Sejnowski, “Learning Overcomplete Representations,” Neural Computation, vol. 12, pp. 337-365, 2000.
[24] S. Mallat, A Wavelet Tour of Signal Processing. San Diego, Calif.: Academic Press, 1998.
[25] D. McAllester, “Some PAC-Bayesian Theorems,” Machine Learning, vol. 37, pp. 355-363, 1999.
[26] R. Meir and T. Zhang, “Generalization Error Bounds for Bayesian Mixture Algorithms,” J. Machine Learning Research, vol. 4, pp. 839-860, 2003.
[27] T. Minka, “A Comparison of Numerical Optimizers for Logistic Regression,” technical report, Dept. of Statistics, Carnegie Mellon Univ., 2003.
[28] R. Neal, Bayesian Learning for Neural Networks. New York: Springer Verlag, 1996.
[29] A.Y. Ng, “Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance,” Proc. Int'l Conf. Machine Learning, 2004.
[30] B. Olshausen and D. Field, “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,” Nature, vol. 381, pp. 607-609, 1996.
[31] Y. Qi, T.P. Minka, R.W. Picard, and Z. Ghahramani, “Predictive Automatic Relevance Determination by Expectation Propagation,” Proc. Int'l Conf. Machine Learning, 2004.
[32] R. Salakhutdinov and S. Roweis, “Adaptive Overrelaxed Bound Optimization Methods,” Proc. Int'l Conf. Machine Learning, pp. 664-671, 2003.
[33] M. Seeger, “PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification,” J. Machine Learning Research, vol. 3, pp. 233-269, 2002.
[34] R. Tibshirani, “Regression Shrinkage and Selection via the LASSO,” J. Royal Statistical Soc. B, vol. 58, no. 1, pp. 267-288, 1996.
[35] M. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” J. Machine Learning Research, vol. 1, pp. 211-244, 2001.
[36] M. Tipping and A. Faul, “Fast Marginal Likelihood Maximisation for Sparse Bayesian Models,” Proc. Ninth Int'l Workshop Artificial Intelligence and Statistics, C. Bishop and B. Frey, eds., 2003.
[37] L. Valiant, “A Theory of the Learnable,” Comm. ACM, vol. 27, pp. 1134-1142, 1984.
[38] V. Vapnik, Statistical Learning Theory. New York: John Wiley, 1998.
[39] J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping, “Use of the Zero-Norm with Linear Models and Kernel Methods,” J. Machine Learning Research, vol. 3, pp. 1439-1461, 2003.
[40] C. Williams and D. Barber, “Bayesian Classification with Gaussian Priors,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 12, pp. 1342-1351, Dec. 1998.
[41] P. Williams, “Bayesian Regularization and Pruning Using a Laplace Prior,” Neural Computation, vol. 7, pp. 117-143, 1995.
[42] T. Zhang and F. Oles, “Regularized Linear Classification Methods,” Information Retrieval, vol. 4, pp. 5-31, 2001.
[43] J. Zhu and T. Hastie, “Kernel Logistic Regression and the Import Vector Machine,” Advances in Neural Information Processing Systems 14, T. Dietterich, S. Becker, and Z. Ghahramani, eds., pp. 1081-1088, Cambridge, Mass.: MIT Press, 2002.

Index Terms:
Supervised learning, classification, sparsity, Bayesian inference, multinomial logistic regression, bound optimization, expectation maximization (EM), learning theory, generalization bounds.
Citation:
Balaji Krishnapuram, Lawrence Carin, Mário A.T. Figueiredo, Alexander J. Hartemink, "Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 957-968, June 2005, doi:10.1109/TPAMI.2005.127
Usage of this product signifies your acceptance of the Terms of Use.