Issue No. 04 - April (2008 vol. 20)
A novel similarity, neighborhood counting measure, was recently proposed which counts the number of neighborhoods of a pair of data points. This similarity can handle numerical and categorical attributes in a conceptually uniform way, can be calculated efficiently through a simple formula, and gives good performance when tested in the framework of k-nearest neighbor classifier. In particular it consistently outperforms a combination of the classical Euclidean distance and Hamming distance. This measure was also shown to be related to a probability formalism, G probability, which is induced from a target probability function P. It was however unclear how G is related to P, especially for classification. Therefore it was not possible to explain some characteristic features of the neighborhood counting measure. In this paper we show that G is a linear function of P, and G-based Bayes classification is equivalent to P-based Bayes classification. We also show that the k-nearest neighbor classifier, when weighted by the neighborhood counting measure, is in fact an approximation of the G-based Bayes classifier, and furthermore, the P-based Bayes classifier. Additionally we show that the neighborhood counting measure remains unchanged when binary attributes are treated as categorical or numerical data. This is a feature that is not shared by other distance measures, to the best of our knowledge. This study provides a theoretical insight into the neighborhood counting measure.
Decision support, Clustering, classification, and association rules
F. Murtagh and H. Wang, "A Study of the Neighborhood Counting Similarity," in IEEE Transactions on Knowledge & Data Engineering, vol. 20, no. , pp. 449-461, 2007.