The Community for Technology Leaders
RSS Icon
Issue No.08 - August (2009 vol.31)
pp: 1347-1361
Steve R. Gunn , University of Southampton, Southampton
John Shawe-Taylor , University College London, London
The presence of irrelevant features in training data is a significant obstacle for many machine learning tasks. One approach to this problem is to extract appropriate features and, often, one selects a feature extraction method based on the inference algorithm. Here, we formalize a general framework for feature extraction, based on Partial Least Squares, in which one can select a user-defined criterion to compute projection directions. The framework draws together a number of existing results and provides additional insights into several popular feature extraction methods. Two new sparse kernel feature extraction methods are derived under the framework, called Sparse Maximal Alignment (SMA) and Sparse Maximal Covariance (SMC), respectively. Key advantages of these approaches include simple implementation and a training time which scales linearly in the number of examples. Furthermore, one can project a new test example using only k kernel evaluations, where k is the output dimensionality. Computational results on several real-world data sets show that SMA and SMC extract features which are as predictive as those found using other popular feature extraction methods. Additionally, on large text retrieval and face detection data sets, they produce features which match the performance of the original ones in conjunction with a Support Vector Machine.
Machine learning, kernel methods, feature extraction, partial least squares (PLS).
Steve R. Gunn, John Shawe-Taylor, "Efficient Sparse Kernel Feature Extraction Based on Partial Least Squares", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.31, no. 8, pp. 1347-1361, August 2009, doi:10.1109/TPAMI.2008.171
[1] B.E. Boser, I.M. Guyon, and V.N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” Proc. Fifth Ann. Conf. Computational Learning Theory, pp. 144-152, 1992.
[2] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, “Feature Selection for SVMs,” Advances in Neural Information Processing Systems 13, T.K. Leen, T.G. Dietterich, and V. Tresp, eds., pp. 668-674, MIT Press, 2000.
[3] J. Bi, K.P. Bennett, M.J. Embrechts, C.M. Breneman, and M. Song, “Dimensionality Reduction via Sparse Support Vector Machines,” J. Machine Learning Research, vol. 3, pp. 1229-1243, 2003.
[4] J.H. Friedman and J.W. Tukey, “A Projection Pursuit Algorithm for Exploratory Data Analysis,” IEEE Trans. Computers, vol. 23, no. 9, pp. 881-889, Sept. 1974.
[5] P.J. Huber, “Projection Pursuit,” The Annals of Statistics, vol. 13, no. 2, pp. 435-475, 1985.
[6] A.J. Smola, O.L. Mangasarian, and B. Scholkopf, “Sparse Kernel Feature Analysis,” technical report, Data Mining Inst., Univ. of Wisconsin-Madison, 1999.
[7] H. Hotelling, “Analysis of a Complex of Statistical Variables into Principle Components,” J. Educational Psychology, vol. 24, pp. 417-441 and 498-520, 1933.
[8] H. Wold, “Estimation of Principal Components and Related Models by Iterative Least Squares,” Multivariate Analysis, pp. 391-420, 1966.
[9] M. Momma and K.P. Bennett, “Constructing Orthogonal Latent Features for Arbitrary Loss,” Feature Extraction, Foundations and Applications, I.M. Guyon, S.R. Gunn, M. Nikravesh, and L. Zadeh, eds., pp. 551-583, 2005.
[10] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J.S. Kandola, “On Kernel-Target Alignment,” Advances in Neural Information Processing Systems 14, T.G. Dietterich, S. Becker, and Z. Ghahramani, eds., pp. 367-373, 2001.
[11] C. Dhanjal, S.R. Gunn, and J. Shawe-Taylor, “Sparse Feature Extraction Using Generalised Partial Least Squares,” Proc. IEEE Int'l Workshop Machine Learning for Signal Processing, pp. 27-32, 2006.
[12] W. Massy, “Principal Components Regression in Exploratory Statistical Research,” J. Am. Statistical Assoc., vol. 60, no. 309, pp.234-256, 1965.
[13] B. Schölkopf, A.J. Smola, and K.-R. Müller, “Nonlinear Component Analysis as a Kernel Eigenvalue Problem,” Neural Computation, vol. 10, no. 5, pp. 1299-1319, 1998.
[14] A. d'Aspremont, L. El Ghaoui, M.I. Jordan, and G.R.G. Lanckriet, “A Direct Formulation for Sparse PCA Using Semidefinite Programming,” Technical Report UCB/CSD-04-1330, Electrical Eng. and Computer Science Dept., Univ. of California, Berkeley, June 2004.
[15] B. Moghaddam, Y. Weiss, and S. Avidan, “Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms,” Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J.Platt, eds., pp. 915-922, 2006.
[16] B. Moghaddam, Y. Weiss, and S. Avidan, “Generalized Spectral Bounds for Sparse LDA,” Proc. 23rd Int'l Conf. Machine Learning, W.W.Cohen and A. Moore, eds., pp. 641-648, 2006.
[17] M. Barker and W. Rayens, “Partial Least Squares for Discrimination,” J. Chemometrics, vol. 17, no. 3, pp. 166-173, 2003.
[18] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.
[19] R. Rosipal and N. Kramer, “Overview and Recent Advances in Partial Least Squares,” Subspace, Latent Structure and Feature Selection Techniques, pp. 34-51, 2006.
[20] R. Manne, “Analysis of Two Partial-Least-Squares Algorithms for Multivariate Calibration,” Chemometrics and Intelligent Laboratory Systems, vol. 2, no. 1, pp. 187-197, 1987.
[21] M. Stone and R.J. Brooks, “Continuum Regression: Cross-Validated Sequentially Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares and Principal Components Regression,” J. Royal Statistical Soc. Series B (Methodological), vol. 52, no. 2, pp. 237-269, 1990.
[22] R. Rosipal and L.J. Trejo, “Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space,” J. Machine Learning Research, vol. 2, pp. 97-123, 2001.
[23] R. Rosipal, L.J. Trejo, and B. Matthews, “Kernel PLS-SVC for Linear and Nonlinear Classification,” Proc. 12th Int'l Conf. Machine Learning, A. Prieditis and S.J. Russell, eds., pp. 640-647, 2003.
[24] Y. Freund and R. Schapire, “A Short Introduction to Boosting,” J.Japanese Soc. for Artificial Intelligence, vol. 14, no. 5, pp. 771-780, Sept. 1999.
[25] K. Crammer, J. Keshet, and Y. Singer, “Kernel Design Using Boosting,” Advances in Neural Information Processing Systems 15, S.Becker, S. Thrun, and K. Obermayer, eds., pp. 537-544, 2002.
[26] M. Momma and K.P. Bennett, “Sparse Kernel Partial Least Squares Regression,” Proc. 16th Ann. Conf. Computational Learning Theory, B. Schölkopf and M.K. Warmuth, eds., pp. 216-230, 2003.
[27] H. Drucker, C.J.C. Burges, L. Kaufman, A. Smola, and V. Vapnik, “Support Vector Regression Machines,” Advances in Neural Information Processing Systems 9, M. Mozer, M.I. Jordan, and T.Petsche, eds., pp. 155-161, 1997.
[28] M. Momma, “Efficient Computations via Scalable Sparse Kernel Partial Least Squares and Boosted Latent Features,” Proc. 11th ACM Int'l Conf. Knowledge Discovery in Data Mining, pp. 654-659, 2005.
[29] J. Arenas-García, K.B. Petersen, and L.K. Hansen, “Sparse Kernel Orthonormalized PLS for Feature Extraction in Large Data Sets,” Advances in Neural Information Processing Systems 19, B. Schölkopf, J.C. Platt, and T. Hoffman, eds., pp. 33-40, 2006.
[30] K.J. Worsley, J.B. Poline, K.J. Friston, and A.C. Evans, “Characterizing the Response of PET and fMRI Data Using Multivariate Linear Models,” Neuroimage, vol. 6, no. 4, pp. 305-319, 1997.
[31] G. Strang, Introduction to Linear Algebra, third ed. Wellesley Cambridge Press, 2003.
[32] J. Shawe-Taylor, C.K.I. Williams, N. Cristianini, and J.S. Kandola, “On the Eigenspectrum of the Gram Matrix and the Generalization Error of Kernel PCA,” IEEE Trans. Information Theory, vol. 51, no. 7, pp. 2510-2522, 2005.
[33] M. Ledoux and M. Talagrand, Probability in Banach Spaces: Isoperimetry and Processes. Springer, May 1991.
[34] P.L. Bartlett and S. Mendelson, “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results,” J. Machine Learning Research, vol. 3, pp. 463-482, 2003.
[35] H.A. Guvenir and I. Uysal, Bilkent University Function Approximation Repository, http:/, 2000.
[36] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, UCI Repository of Machine Learning Databases, , 1998.
[37] T. Rose, M. Stevenson, and M. Whitehead, “The Reuters Corpus Volume 1—From Yesterday's News to Tomorrow's Language Resources,” Proc. Third Int'l Conf. Language Resources and Evaluation, pp. 827-832, 2002.
[38] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines, software available at ~cjlinlibsvm , 2001.
[39] V.N. Vapnik, Statistical Learning Theory. Wiley-Interscience, Sept. 1998.
[40] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning, C. Nédellec and C. Rouveirol, eds., pp. 137-142, 1998.
[41] T. Joachims, “Training Linear SVMs in Linear Time,” Proc. 12th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, T.Eliassi-Rad, L.H. Ungar, M. Craven, and D. Gunopulos, eds., pp. 217-226, 2006.
[42] W.-Y. Zhao, R. Chellappa, P.J. Phillips, and A. Rosenfeld, “Face Recognition: A Literature Survey,” ACM Computing Surveys, vol. 35, no. 4, pp. 399-458, 2003.
[43] M.A. Turk and A.P. Pentland, “Eigenfaces for Recognition,” J.Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.
[44] MIT CBCL Face Database 1. Center for Biological and Computational Learning, MIT,, 1996.
[45] H.A. Rowley, S. Baluja, and T. Kanade, “Neural Network-Based Face Detection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23-38, Jan. 1998.
[46] P. Viola and M.J. Jones, “Robust Real-Time Face Detection,” Int'l J. Computer Vision, vol. 57, no. 2, pp. 137-154, May 2004.
[47] B. Heisele, T. Poggio, and M. Pontil, “Face Detection in Still Gray Images,” A.I. Memo 1687, Center for Biological and Computational Learning, Massachusetts Inst. of Tech nology, 2000.
10 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool