This Article 
 Bibliographic References 
 Add to: 
From Sample Similarity to Ensemble Similarity: Probabilistic Distance Measures in Reproducing Kernel Hilbert Space
June 2006 (vol. 28 no. 6)
pp. 917-929
This paper addresses the problem of characterizing ensemble similarity from sample similarity in a principled manner. Using reproducing kernel as a characterization of sample similarity, we suggest a probabilistic distance measure in the reproducing kernel Hilbert space (RKHS) as the ensemble similarity. Assuming normality in the RKHS, we derive analytic expressions for probabilistic distance measures that are commonly used in many applications, such as Chernoff distance (or the Bhattacharyya distance as its special case), Kullback-Leibler divergence, etc. Since the reproducing kernel implicitly embeds a nonlinear mapping, our approach presents a new way to study these distances whose feasibility and efficiency is demonstrated using experiments with synthetic and real examples. Further, we extend the ensemble similarity to the reproducing kernel for ensemble and study the ensemble similarity for more general data representations.

[1] P. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. Prentice Hall, 1982.
[2] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. Wiley-Interscience, 2001.
[3] T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley, 1991.
[4] T. Kailath, “The Divergance and Bhattacharyya Distance Measures in Signal Selection,” IEEE Trans. Comm. Technology, vol. 15, no. 1, pp. 52-60, 1967.
[5] J. Mercer, “Functions of Positive and Negative Type and Their Connection with the Theory of Integral Equations,” Philosophical Trans. Royal Soc. London A, vol. 209, pp. 415-446, 1909.
[6] N. Aronszajn, “Theory of Reproducing Kernels,” Trans. Am. Math. Soc., vol. 68, no. 3, pp. 337-404, 1950.
[7] B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear Component Analysis as a Kernel Eigenvalue Problem,” Neural Computation, vol. 10, no. 5, pp. 1299-1319, 1998.
[8] G. Baudat and F. Anouar, “Generalized Discriminant Analysis Using a Kernel Approach,” Neural Computation, vol. 12, no. 10, pp. 2385-2404, 2000.
[9] F. Bach and M.I. Jordan, “Kernel Independent Component Analysis,” J. Machine Learning Research, vol. 3, pp. 1-48, 2002.
[10] F. Bach and M.I. Jordan, “Learning Graphical Models with Mercer Kernels,” Neural Information Processing Systems, 2002.
[11] R. Kondon and T. Jebara, “A Kernel between Sets of Vectors,” Proc. Int'l Conf. Machine Learning (ICML), 2003.
[12] Z. Zhang, D. Yeung, and J. Kwok, “Wishart Processes: A Statistical View of Reproducing Kernels,” Technical Report KHUST-CS401-01, 2004.
[13] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
[14] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text Classification Using String Kernels,” J. Machine Learning Research, vol. 2, pp. 419-444, 2002.
[15] R. Kondor and J. Lafferty, “Diffusion Kernels on Graphs and Other Discrete Input Spaces,” Proc. Int'l Conf. Machine Learning, 2002.
[16] C. Cortes, P. Haffner, and M. Mohri, “Lattice Kernels for Spoken-Dialog Classification,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, 2003.
[17] T. Jaakkola and D. Haussler, “Exploiting Generative Models in Discriminative Classifiers,” Proc. Conf. Neural Information Processing Systems, vol. 11, 1999.
[18] K. Tsuda, M. Kawanabe, G. Rätsch, S. Sonnenburg, and K. Müller, “A New Discriminative Kernel from Probabilistic Models,” Proc. Conf. Neural Information Processing Systems, vol. 14, 2002.
[19] M. Seeger, “Covariances Kernel from Bayesian Generative Models,” Proc. Conf. Neural Information Processing Systems, vol. 14, pp. 905-912, 2002.
[20] M. Collins and N. Duffy, “Convolution Kernels for Natural Language,” Proc. Conf. Neural Information Processing Systems, vol. 14, pp. 625-632, 2002.
[21] L. Wolf and A. Shashua, “Learning over Sets Using Kernel Principal Angles,” J. Machine Learning Research, vol. 4, pp. 895-911, 2003.
[22] H. Chernoff, “A Measure of Asymptotic Efficiency of Tests for a Hypothesis Based on a Sum of Observations,” Annals of Math. Statistics, vol. 23, pp. 493-507, 1952.
[23] A. Bhattacharyya, “On a Measure of Divergence between Two Statistical Populations Defined by Their Probability Distributions,” Bull. Calcutta Math. Soc., vol. 35, pp. 99-109, 1943.
[24] K. Matusita, “Decision Rules Based on the Distance for Problems of Fit, Two Samples and Estimation,” Annals Math. Statistics, vol. 26, pp. 631-640, 1955.
[25] E. Patrick and F. Fisher, “Nonparametric Feature Selection,” IEEE Trans. Information Theory, vol. 15, pp. 577-584, 1969.
[26] T. Lissack and K. Fu, “Error Estimation in Pattern Recognition via L-Distance between Posterior Density Functions,” IEEE Trans. Information Theory, vol. 22, pp. 34-45, 1976.
[27] B. Adhikara and D. Joshi, “Distance Discrimination et Résumé Exhaustif,” Publs. Inst. Statistics, vol. 5, pp. 57-74, 1956.
[28] P. Mahalanobis, “On the Generalized Distance in Statistics,” Proc. Nat'l Inst. Science (India), vol. 12, pp. 49-55, 1936.
[29] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001.
[30] M. Tipping, “Sparse Kernel Prinicipal Component Analysis,” Neural Information Processing Systems, 2001.
[31] L. Wolf and A. Shashua, “Kernel Principal Angles for Classification Machines with Applications to Image Sequence Interpretation,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2003.
[32] T. Jebara and R. Kondon, “Bhattarcharyya and Expected Likehood Kernels,” Proc. Conf. Learning Theory (COLT), 2003.
[33] N. Vasconcelos, P. Ho, and P. Moreno, “The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition,” Proc. European Conf. Computer Vision, 2004.
[34] P. Moreno, P. Ho, and N. Vasconcelos, “A Kullback-Leibler Divergence Based Kernel for SVM Classfication in Multimedia Applications,” Neural Information Processing Systems, 2003.
[35] T. Jebara, “Images as Bags of Pixels,” Proc. IEEE Int'l Conf. Computer Vision, 2003.
[36] S. Zhou and R. Chellappa, “Beyond a Single Still Image: Face Recognition from Multiple Still Images and Videos,” Face Processing: Advanced Modeling and Methods, 2005.
[37] G. Shakhnarovich, J. Fisher, and T. Darrell, “Face Recognition from Long-Term Observations,” Proc. European Conf. Computer Vision, 2002.
[38] K. Lee, M. Yang, and D. Kriegman, “Video-Based Face Recognition Using Probabilistic Appearance Manifolds,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2003.
[39] M. Turk and A. Pentland, “Eigenfaces for Recognition,” J. Cognitive Neutoscience, vol. 3, pp. 72-86, 1991.
[40] K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis. Academic Press 1979.
[41] M.E. Tipping and C.M. Bishop, “Probabilistic Principal Component Analysis,” J. Royal Statistical Soc., Series B, vol. 61, no. 3, pp. 611-622, 1999.

Index Terms:
Ensemble similarity, kernel methods, Chernoff distance, Bhattacharyya distance, Kullback-Leibler (KL) divergence/relative entropy, Patrick-Fisher distance, Mahalonobis distance, reproducing kernel Hilbert space.
Shaohua Kevin Zhou, Rama Chellappa, "From Sample Similarity to Ensemble Similarity: Probabilistic Distance Measures in Reproducing Kernel Hilbert Space," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 6, pp. 917-929, June 2006, doi:10.1109/TPAMI.2006.120
Usage of this product signifies your acceptance of the Terms of Use.