Issue No.10 - October (2010 vol.32)
pp: 1822-1831
JooSeuk Kim , University of Michigan, Ann Arbor
Clayton D. Scott , University of Michigan, Ann Arbor
Nonparametric kernel methods are widely used and proven to be successful in many statistical learning problems. Well--known examples include the kernel density estimate (KDE) for density estimation and the support vector machine (SVM) for classification. We propose a kernel classifier that optimizes the L_2 or integrated squared error (ISE) of a “difference of densities.” We focus on the Gaussian kernel, although the method applies to other kernels suitable for density estimation. Like a support vector machine (SVM), the classifier is sparse and results from solving a quadratic program. We provide statistical performance guarantees for the proposed L_2 kernel classifier in the form of a finite sample oracle inequality and strong consistency in the sense of both ISE and probability of error. A special case of our analysis applies to a previously introduced ISE-based method for kernel density estimation. For dimensionality greater than 15, the basic L_2 kernel classifier performs poorly in practice. Thus, we extend the method through the introduction of a natural regularization parameter, which allows it to remain competitive with the SVM in high dimensions. Simulation results for both synthetic and real-world data are presented.
Kernel methods, sparse classifiers, integrated squared error, difference of densities, SMO algorithm.
JooSeuk Kim, Clayton D. Scott, "L₂ Kernel Classification", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.32, no. 10, pp. 1822-1831, October 2010, doi:10.1109/TPAMI.2009.188
[1] B. Schölkopf and A.J. Smola, Learning with Kernels. MIT Press, 2002.
[2] C. Cortes and V. Vapnik, "Support-Vector Networks," Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
[3] D. Kim, "Least Squares Mixture Decomposition Estimation," unpublished doctoral dissertation, Dept. of Statistics, Virginia Polytechnic Inst. and State Univ., 1995.
[4] M. Girolami and C. He, "Probability Density Estimation from Optimally Condensed Data Samples," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1253-1264, Oct. 2003.
[5] B.A. Turlach, "Bandwidth Selection in Kernel Density Estimation: A Review," Technical Report 9317, C.O.R.E. and Inst. de Statistique, Université Catholique de Louvain, 1993.
[6] D.W. Scott, "Parametric Statistical Modeling by Minimum Integrated Square Error," Technometrics, vol. 43, pp. 274-285, 2001.
[7] F. Bunea, A.B. Tsybakov, and M.H. Wegkamp, "Sparse Density Estimation with $l_1$ Penalties," Proc. 20th Ann. Conf. Learning Theory, pp. 530-543, 2007.
[8] P.H. Rigollet and A.B. Tsybakov, "Linear and Convex Aggregation of Density Estimators," ccsd-00068216 , 2004.
[9] R. Jenssen, D. Erdogmus, J.C. Principe, and T. Eltoft, "Towards a Unification of Information Theoretic Learning and Kernel Method," Proc. IEEE Workshop Machine Learning for Signal Processing, 2004.
[10] C. He and M. Girolami, "Novelty Detection Employing an ${L}_2$ Optimal Nonparametric Density Estimator," Pattern Recognition Letters, vol. 25, pp. 1389-1397, 2004.
[11] P. Hall and M.P. Wand, "On Nonparametric Discrimination Using Density Differences," Biometrika, vol. 75, no. 3, pp. 541-547, Sept. 1988.
[12] M. Di Marzio and C.C. Taylor, "Kernel Density Classification and Boosting: An ${L}_2$ Analysis," Statistics and Computing, vol. 15, pp. 113-123, Apr. 2005.
[13] P. Meinicke, T. Twellmann, and H. Ritter, "Discriminative Densities from Maximum Contrast Estimation," Proc. Advances in Neural Information Processing Systems, vol. 15, pp. 985-992, 2002.
[14] C.T. Wolverton and T.J. Wagner, "Asymptotically Optimal Discriminant Functions for Pattern Classification," IEEE Trans. Information Theory, vol. 15, no. 2, pp. 258-265, Mar. 1969.
[15] K. Pelckmans, J.A.K. Suykens, and B. De Moor, "A Risk Minimization Principle for a Class of Parzen Estimators," Proc. Advances in Neural Information Processing Systems, vol. 20, Dec. 2007.
[16] J. Kim and C. Scott, "Kernel Classification via Integrated Squared Error," Proc. IEEE Workshop Statistical Signal Processing, Aug. 2007.
[17] J. Kim and C. Scott, "Performance Analysis for ${L}_2$ Kernel Classification," Proc. Advances in Neural Information Processing Systems, vol. 21, Dec. 2008.
[18] M.P. Wand and M.C. Jones, Kernel Smoothing. Chapman & Hall, 1995.
[19] J.C. Platt, "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines," Technical Report MSR-TR-98-14, Apr. 2001.
[20] D. Crisp and C. Burges, "A Geometric Interpretation of $\nu$ -SVM Classifiers," Proc. Neural Information Processing Systems, vol. 12, 1999.
[21] K. Bennett, N. Cristianini, J. Shawe-Taylor, and D. Wu, "Enlarging the Margins in Perceptron Decision Trees," Machine Learning, vol. 41, pp. 295-313, 2000.
[22] A.S. Paulson, E.W. Holcomb, and R.A. Leitch, "The Estimation of the Parameters of the Stable Laws," Biometrika, vol. 62, pp. 163-170, 1975.
[23] C.R. Heathcote, "The Integrated Squared Error Estimation of Parameters," Biometrika, vol. 64, pp. 255-264, 1977.
[24] J.A.K. Suykens and J. Vandewalle, "Least Squares Support Vector Machine Classifiers," Neural Processing Letters, vol. 44, no. 8, pp. 293-300, June 1999.
[25] J.R. Schechuk, "An Introduction to the Conjugate Gradient Method without the Agonizing Pain," Technical Report MSR-TR-98-14, Aug. 1994.
[26] D. Berry, K. Chaloner, and J. Geweke, Bayesian Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner. Wiley, 1996.
[27] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schölkopf, "Kernel Methods for Measuring Independence," J. Machine Learning Research, vol. 6, pp. 2075-2129, 2005.
[28] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines,, 2001.
[29] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, "An Introduction to Kernel-Based Learning Algorithms," IEEE Trans. Neural Networks, vol. 12, no. 2, pp. 181-201, Mar. 2001.