This Article 
 Bibliographic References 
 Add to: 
Sparse Multiple Kernel Learning for Signal Processing Applications
May 2010 (vol. 32 no. 5)
pp. 788-798
Niranjan Subrahmanya, Purdue University, West Lafayette
Yung C. Shin, Purdue University, West Lafayette
In many signal processing applications, grouping of features during model development and the selection of a small number of relevant groups can be useful to improve the interpretability of the learned parameters. While a lot of work based on linear models has been reported to solve this problem, in the last few years, multiple kernel learning has come up as a candidate to solve this problem in nonlinear models. Since all of the multiple kernel learning algorithms to date use convex primal problem formulations, the kernel weights selected by these algorithms are not strictly the sparsest possible solution. The main reason for using a convex primal formulation is that efficient implementations of kernel-based methods invariably rely on solving the dual problem. This work proposes the use of an additional log-based concave penalty term in the primal problem to induce sparsity in terms of groups of parameters. A generalized iterative learning algorithm, which can be used with a linear combination of this concave penalty term with other penalty terms, is given for model parameter estimation in the primal space. It is then shown that a natural extension of the method to nonlinear models using the "kernel trick” results in a new algorithm, called Sparse Multiple Kernel Learning (SMKL), which generalizes group-feature selection to kernel selection. SMKL is capable of exploiting existing efficient single kernel algorithms while providing a sparser solution in terms of the number of kernels used as compared to the existing multiple kernel learning framework. A number of signal processing examples based on the use of mass spectra for cancer detection, hyperspectral imagery for land cover classification, and NIR spectra from wheat, fescue grass, and diesel are given to highlight the ability of SMKL to achieve a very high accuracy with a very few kernels.

[1] D.L. Hall, Mathematical Techniques in Multisensor Data Fusion. Artech House, Inc., 1992.
[2] M. Vannucci, N. Sha, and P.J. Brown, "NIR and Mass Spectra Classification: Bayesian Methods for Wavelet-Based Feature Selection," Chemometrics and Intelligent Laboratory Systems, vol. 77, nos. 1/2, pp. 139-148, May 2005.
[3] F. Rossi et al., "Fast Selection of Spectral Variables with B-Spline Compression," Chemometrics and Intelligent Laboratory Systems, Selected Papers Presented at the Chemometrics Congress, vol. 86, no. 2, pp. 208-218, Apr. 2007.
[4] H. Martens and M. Martens, Multivariate Analysis of Quality: An Introduction. John Wiley & Sons, Ltd., 2001.
[5] B. Scholkopf and A.J. Smola, Learning with Kernels. MIT Press, 2002.
[6] I. Guyon et al., Feature Extraction, Foundations and Applications. Physica-Verlag, Springer, 2006.
[7] I. Guyon et al., Feature Extraction, Foundations and Applications, Physica-Verlag, Springer, 2006.
[8] I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection," J. Machine Learning Research, vol. 3, pp. 1157-1182, Mar. 2003.
[9] R. Tibshirani, "Regression Shrinkage and Selection via the Lasso," J. Royal Statistical Soc. Series B, vol. 58, no. 1, pp. 267-288, 1996.
[10] P.S. Bradley and O.L. Mangasarian, "Feature Selection via Concave Minimization and Support Vector Machines," Proc. 15th Int'l Conf. Machine Learning, pp. 82-90, 1998.
[11] M.A.T. Figueiredo, "Adaptive Sparseness for Supervised Learning," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1150-1159, Sept. 2003.
[12] J. Weston et al., "Use of the Zero-Norm with Linear Models and Kernel Methods," J. Machine Learning Research, vol. 3, pp. 1439-1461, Mar. 2003.
[13] M.E. Tipping, "Sparse Bayesian Learning and the Relevance Vector Machine," J. Machine Learning Research, vol. 1, no. 3, pp. 211-244, 2001.
[14] T.N. Lal et al., "Support Vector Channel Selection in BCI," IEEE Trans. Biomedical Eng., vol. 51, no. 6, pp. 1003-1010, June 2004.
[15] Y. Kim, J. Kim, and Y. Kim, "Blockwise Sparse Regression," Statistica Sinica, vol. 16, pp. 375-390, 2006.
[16] T. Similä and J. Tikka, "Input Selection and Shrinkage in Multiresponse Linear Regression," Computational Statistics and Data Analysis, vol. 52, pp. 406-422, 2007.
[17] L. Wang, G. Chen, and H. Li, "Group SCAD Regression Analysis for Microarray Time Course Gene Expression Data," Bioinformatics, vol. 23, no. 12, pp. 1486-1494, 2007.
[18] M. Yuan and Y.B. Lin, "Model Selection and Estimation in Regression with Grouped Variables," J. Royal Statistical Soc., vol. 68, pp. 49-67, 2006.
[19] P. Zhao, G. Rocha, and B. Yu, "Grouped and Hierarchical Model Selection through Composite Absolute Penalties," technical report, Univ. of California, 2006.
[20] J. Stoeckel and G. Fung, "SVM Feature Selection for Classification of SPECT Images of Alzheimer's Disease Using Spatial Information," Proc. Fifth IEEE Int'l Conf. Data Mining, pp. 410-417, 2005.
[21] S.F. Cotter et al., "Sparse Solutions to Linear Inverse Problems with Multiple Measurement Vectors," IEEE Trans. Signal Processing, vol. 53, no. 7, pp. 2477-2488, July 2005.
[22] S. Ma, X. Song, and J. Huang, "Supervised Group Lasso with Applications to Microarray Data Analysis," Bioinformatics, vol. 8, no. 60, 2007.
[23] L. Meier, S. van de Geer, and P. Buhlmann, "The Group Lasso for Logistic Regression," technical report, Eidgenössische Technische Hochschule, 2006.
[24] G.R.G. Lanckriet et al., "Learning the Kernel Matrix with Semidefinite Programming," J. Machine Learning Research, vol. 5, pp. 27-72, 2004.
[25] F.R. Bach, G.R.G. Lanckriet, and M.I. Jordan, "Multiple Kernel Learning, Conic Duality, and the SMO Algorithm," Proc. 21st Int'l Conf. Machine Learning, pp. 41-48, 2004.
[26] S. Sonnenburg et al., "Large Scale Multiple Kernel Learning," J. Machine Learning Research, vol. 7, pp. 1531-1565, 2006.
[27] M. Girolami and S. Rogers, "Hierarchic Bayesian Models for Kernel Learning," Proc. 22nd Int'l Conf. Machine Learning, pp. 241-248, 2005.
[28] A. Rakotomamonjy et al., "More Efficiency in Multiple Kernel Learning," Proc. 24th Int'l Conf. Machine Learning, pp. 775-782, 2007.
[29] K. Lange, D. Hunter, and I. Yang, "Optimization Transfer Using Surrogate Objective Functions," J. Computational and Graphical Statistics, vol. 9, pp. 1-59, 2000.
[30] B. Krishnapuram et al., "Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 957-968, June 2005.
[31] M. Girolami, "A Variational Method for Learning Sparse and Overcomplete Representations," Neural Computation, vol. 13, no. 11, pp. 2517-2532, 2001.
[32] Z. Zhang, J.T. Kwok, and D.-Y. Yeung, "Surrogate Maximization/Minimization Algorithms for AdaBoost and the Logistic Regression Model," Proc. 21st Int'l Conf. Machine Learning, pp. 927-934, 2004.
[33] E.J. Candès, M. Wakin, and S. Boyd, "Enhancing Sparsity by Reweighted l1 Minimization," J. Fourier Analysis and Applications, vol. 14, pp. 877-905, 2007.
[34] D.P. Wipf and B.D. Rao, "An Empirical Bayesian Strategy for Solving the Simultaneous Sparse Approximation Problem," IEEE Trans. Signal Processing, vol. 55, no. 7, pp. 3704-3716, July 2007.
[35] J. Neumann, C. Schnorr, and G. Steidl, "Combined SVM-Based Feature Selection and Classification," Machine Learning, vol. 61, nos. 1-3, pp. 129-150, 2005.
[36] N. Subrahmanya and Y.C. Shin, "Automated Sensor Selection and Fusion for Monitoring and Diagnostics of Plunge Grinding," J. Manufacturing Science and Eng., Trans. ASME, vol. 130, no. 3, 031014, 2008.
[37] R. Neal and G. Hinton, "A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants," Learning in Graphical Models, M.I. Jordan, ed., Kluwer, 1998.
[38] B. Krishnapuram et al., "A Bayesian Approach to Joint Feature Selection and Classifier Design," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp. 1105-1111, Sept. 2004.
[39] J.A.K. Suykens and J. Vandewalle, "Least Squares Support Vector Machine Classifiers," Neural Processing Letters, vol. 9, no. 3, pp. 293-300, June 1999.
[40] T. Joachims, "Making Large-Scale SVM Learning Practical," Advances in Kernel Methods—Support Vector Learning: B. Schölkopf, C. Burges, and A. Smola, eds., MIT-Press, 1999.
[41] D.J. Newman et al., UCI Repository of Machine Learning Databases, Dept. of Information and Computer Science, Univ. of California, Irvine,∼mlearnMLRepository.html , 1998.
[42] J. Platt, "Fast Training of Support Vector Machines Using Sequential Minimal Optimization," Advances in Kernel Methods— Support Vector Learning, MIT Press, 1998.
[43] E.F. Petricoin et al., "Use of Proteomic Patterns in Serum to Identify Ovarian Cancer," Lancet, vol. 359, pp. 572-577, 2002.
[44] G. Alexe et al., "Ovarian Cancer Detection by Logical Analysis of Proteomic Data," Proteomics, vol. 4, no. 3, pp. 766-783, 2004.
[45] G. Camps-Valls et al., "Composite Kernels for Hyperspectral Image Classification," IEEE Geoscience and Remote Sensing Letters, vol. 3, no. 1, pp. 93-97, Jan. 2006.

Index Terms:
Composite kernel learning, feature group selection, heterogeneous data fusion, sensor selection.
Niranjan Subrahmanya, Yung C. Shin, "Sparse Multiple Kernel Learning for Signal Processing Applications," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 5, pp. 788-798, May 2010, doi:10.1109/TPAMI.2009.98
Usage of this product signifies your acceptance of the Terms of Use.