Issue No.04 - April (2008 vol.20)
pp: 449-461
A novel similarity, neighborhood counting measure, was recently proposed which counts the number of neighborhoods of a pair of data points. This similarity can handle numerical and categorical attributes in a conceptually uniform way, can be calculated efficiently through a simple formula, and gives good performance when tested in the framework of k-nearest neighbor classifier. In particular it consistently outperforms a combination of the classical Euclidean distance and Hamming distance. This measure was also shown to be related to a probability formalism, G probability, which is induced from a target probability function P. It was however unclear how G is related to P, especially for classification. Therefore it was not possible to explain some characteristic features of the neighborhood counting measure. In this paper we show that G is a linear function of P, and G-based Bayes classification is equivalent to P-based Bayes classification. We also show that the k-nearest neighbor classifier, when weighted by the neighborhood counting measure, is in fact an approximation of the G-based Bayes classifier, and furthermore, the P-based Bayes classifier. Additionally we show that the neighborhood counting measure remains unchanged when binary attributes are treated as categorical or numerical data. This is a feature that is not shared by other distance measures, to the best of our knowledge. This study provides a theoretical insight into the neighborhood counting measure.
Decision support, Clustering, classification, and association rules
Hui Wang, "A Study of the Neighborhood Counting Similarity", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 4, pp. 449-461, April 2008, doi:10.1109/TKDE.2007.190721
[1] R.B. Ash and C. Doléans-Dade, Probability and Measure Theory. Academic Press, 2000.
[2] A. Asuncion and D.J. Newman, “UCI Machine Learning Repository,” , 2007.
[3] E. Blanzieri and F. Ricci, “Probability-Based Metrics for Nearest Neighbor Classification and Case-Based Reasoning,” Lecture Notes in Computer Science 1650, pp. 14-29, 1999.
[4] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. John Wiley & Sons, 1973.
[5] W. Feller, An Introduction to Probability Theory and Its Applications. Wiley, 1968.
[6] Wikimedia Foundation, Wikipedia: The Free Encyclopedia, http:/, 2007.
[7] A. Goldberger, L. Amaral, L. Glass, J. Hausdorff, P. Ivanov, R. Mark, J.E. Mietus, G.B. Moody, C.K. Peng, and H.E. Stanley, “PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resources for Complex Physiologic Signals,” Circulation, vol. 101, no. 23, pp. e215-e220, 2000.
[8] D.J. Hand, Discrimination and Classification. Wiley, 1981.
[9] J. Handl and J. Knowles, “Cluster Generators: Synthetic Data for the Evaluation of Clustering Algorithms,” /, 2007.
[10] M.C Jones, J.S. Marron, and S.J. Sheather, “A Brief Survey of Bandwidth Selection for Density Estimation,” J. Am. Statistical Assoc., vol. 91, pp. 401-407, 1996.
[11] E. Keogh and T. Folias, “The UCR Time Series Data Mining Archive,” 2002, ,
[12] E. Keogh and S. Kasetty, “On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration,” Proc. ACM SIGKDD, 2002.
[13] E. Keogh, S. Lonardi, and C. Ratanamahatana, “Towards Parameter-Free Data Mining,” Proc. SIGKDD, 2004.
[14] Z. Lin and H. Wang, “All Common Subtrees,” technical report, 2007.
[15] C. Stanfill and D. Waltz, “Toward Memory-Based Reasoning,” Comm. ACM, vol. 29, pp. 1213-1229, 1986.
[16] H. Wang, “Nearest Neighbors by Neighborhood Counting,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 6, pp.942-953, June 2006.
[17] H. Wang and W. Dubitzky, “A Flexible and Robust Similarity Measure Based on Contextual Probability,” Proc. 19th Int'l Joint Conf. Artificial Intelligence (IJCAI '05), pp. 27-32, 2005.
[18] H. Wang, I. Düntsch, G. Gediga, and A. Skowron, “Hyperrelations in Version Space,” Int'l J. Approximate Reasoning, vol. 36, no. 3, pp.223-241, 2004.
[19] H. Wang, “All Common Subsequences,” Proc. 21st Int'l Joint Conf. Artificial Intelligence (IJCAI '07), pp. 635-640, 2007.
[20] A. Webb, Statistical Pattern Recognition. Wiley, 2004.
[21] D. Randal Wilson and T.R. Martinez, “Improved Heterogeneous Distance Functions,” J. Artificial Intelligence Research, vol. 6, pp. 1-34, 1997.