This Article 
 Bibliographic References 
 Add to: 
L₂ Kernel Classification
October 2010 (vol. 32 no. 10)
pp. 1822-1831
JooSeuk Kim, University of Michigan, Ann Arbor
Clayton D. Scott, University of Michigan, Ann Arbor
Nonparametric kernel methods are widely used and proven to be successful in many statistical learning problems. Well--known examples include the kernel density estimate (KDE) for density estimation and the support vector machine (SVM) for classification. We propose a kernel classifier that optimizes the L_2 or integrated squared error (ISE) of a “difference of densities.” We focus on the Gaussian kernel, although the method applies to other kernels suitable for density estimation. Like a support vector machine (SVM), the classifier is sparse and results from solving a quadratic program. We provide statistical performance guarantees for the proposed L_2 kernel classifier in the form of a finite sample oracle inequality and strong consistency in the sense of both ISE and probability of error. A special case of our analysis applies to a previously introduced ISE-based method for kernel density estimation. For dimensionality greater than 15, the basic L_2 kernel classifier performs poorly in practice. Thus, we extend the method through the introduction of a natural regularization parameter, which allows it to remain competitive with the SVM in high dimensions. Simulation results for both synthetic and real-world data are presented.

[1] B. Schölkopf and A.J. Smola, Learning with Kernels. MIT Press, 2002.
[2] C. Cortes and V. Vapnik, "Support-Vector Networks," Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
[3] D. Kim, "Least Squares Mixture Decomposition Estimation," unpublished doctoral dissertation, Dept. of Statistics, Virginia Polytechnic Inst. and State Univ., 1995.
[4] M. Girolami and C. He, "Probability Density Estimation from Optimally Condensed Data Samples," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1253-1264, Oct. 2003.
[5] B.A. Turlach, "Bandwidth Selection in Kernel Density Estimation: A Review," Technical Report 9317, C.O.R.E. and Inst. de Statistique, Université Catholique de Louvain, 1993.
[6] D.W. Scott, "Parametric Statistical Modeling by Minimum Integrated Square Error," Technometrics, vol. 43, pp. 274-285, 2001.
[7] F. Bunea, A.B. Tsybakov, and M.H. Wegkamp, "Sparse Density Estimation with $l_1$ Penalties," Proc. 20th Ann. Conf. Learning Theory, pp. 530-543, 2007.
[8] P.H. Rigollet and A.B. Tsybakov, "Linear and Convex Aggregation of Density Estimators," ccsd-00068216 , 2004.
[9] R. Jenssen, D. Erdogmus, J.C. Principe, and T. Eltoft, "Towards a Unification of Information Theoretic Learning and Kernel Method," Proc. IEEE Workshop Machine Learning for Signal Processing, 2004.
[10] C. He and M. Girolami, "Novelty Detection Employing an ${L}_2$ Optimal Nonparametric Density Estimator," Pattern Recognition Letters, vol. 25, pp. 1389-1397, 2004.
[11] P. Hall and M.P. Wand, "On Nonparametric Discrimination Using Density Differences," Biometrika, vol. 75, no. 3, pp. 541-547, Sept. 1988.
[12] M. Di Marzio and C.C. Taylor, "Kernel Density Classification and Boosting: An ${L}_2$ Analysis," Statistics and Computing, vol. 15, pp. 113-123, Apr. 2005.
[13] P. Meinicke, T. Twellmann, and H. Ritter, "Discriminative Densities from Maximum Contrast Estimation," Proc. Advances in Neural Information Processing Systems, vol. 15, pp. 985-992, 2002.
[14] C.T. Wolverton and T.J. Wagner, "Asymptotically Optimal Discriminant Functions for Pattern Classification," IEEE Trans. Information Theory, vol. 15, no. 2, pp. 258-265, Mar. 1969.
[15] K. Pelckmans, J.A.K. Suykens, and B. De Moor, "A Risk Minimization Principle for a Class of Parzen Estimators," Proc. Advances in Neural Information Processing Systems, vol. 20, Dec. 2007.
[16] J. Kim and C. Scott, "Kernel Classification via Integrated Squared Error," Proc. IEEE Workshop Statistical Signal Processing, Aug. 2007.
[17] J. Kim and C. Scott, "Performance Analysis for ${L}_2$ Kernel Classification," Proc. Advances in Neural Information Processing Systems, vol. 21, Dec. 2008.
[18] M.P. Wand and M.C. Jones, Kernel Smoothing. Chapman & Hall, 1995.
[19] J.C. Platt, "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines," Technical Report MSR-TR-98-14, Apr. 2001.
[20] D. Crisp and C. Burges, "A Geometric Interpretation of $\nu$ -SVM Classifiers," Proc. Neural Information Processing Systems, vol. 12, 1999.
[21] K. Bennett, N. Cristianini, J. Shawe-Taylor, and D. Wu, "Enlarging the Margins in Perceptron Decision Trees," Machine Learning, vol. 41, pp. 295-313, 2000.
[22] A.S. Paulson, E.W. Holcomb, and R.A. Leitch, "The Estimation of the Parameters of the Stable Laws," Biometrika, vol. 62, pp. 163-170, 1975.
[23] C.R. Heathcote, "The Integrated Squared Error Estimation of Parameters," Biometrika, vol. 64, pp. 255-264, 1977.
[24] J.A.K. Suykens and J. Vandewalle, "Least Squares Support Vector Machine Classifiers," Neural Processing Letters, vol. 44, no. 8, pp. 293-300, June 1999.
[25] J.R. Schechuk, "An Introduction to the Conjugate Gradient Method without the Agonizing Pain," Technical Report MSR-TR-98-14, Aug. 1994.
[26] D. Berry, K. Chaloner, and J. Geweke, Bayesian Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner. Wiley, 1996.
[27] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schölkopf, "Kernel Methods for Measuring Independence," J. Machine Learning Research, vol. 6, pp. 2075-2129, 2005.
[28] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines,, 2001.
[29] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, "An Introduction to Kernel-Based Learning Algorithms," IEEE Trans. Neural Networks, vol. 12, no. 2, pp. 181-201, Mar. 2001.

Index Terms:
Kernel methods, sparse classifiers, integrated squared error, difference of densities, SMO algorithm.
JooSeuk Kim, Clayton D. Scott, "L₂ Kernel Classification," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1822-1831, Oct. 2010, doi:10.1109/TPAMI.2009.188
Usage of this product signifies your acceptance of the Terms of Use.