Subscribe

Issue No.04 - April (2008 vol.20)

pp: 449-461

ABSTRACT

A novel similarity, neighborhood counting measure, was recently proposed which counts the number of neighborhoods of a pair of data points. This similarity can handle numerical and categorical attributes in a conceptually uniform way, can be calculated efficiently through a simple formula, and gives good performance when tested in the framework of k-nearest neighbor classifier. In particular it consistently outperforms a combination of the classical Euclidean distance and Hamming distance. This measure was also shown to be related to a probability formalism, G probability, which is induced from a target probability function P. It was however unclear how G is related to P, especially for classification. Therefore it was not possible to explain some characteristic features of the neighborhood counting measure. In this paper we show that G is a linear function of P, and G-based Bayes classification is equivalent to P-based Bayes classification. We also show that the k-nearest neighbor classifier, when weighted by the neighborhood counting measure, is in fact an approximation of the G-based Bayes classifier, and furthermore, the P-based Bayes classifier. Additionally we show that the neighborhood counting measure remains unchanged when binary attributes are treated as categorical or numerical data. This is a feature that is not shared by other distance measures, to the best of our knowledge. This study provides a theoretical insight into the neighborhood counting measure.

INDEX TERMS

Decision support, Clustering, classification, and association rules

CITATION

Hui Wang, Fionn Murtagh, "A Study of the Neighborhood Counting Similarity",

*IEEE Transactions on Knowledge & Data Engineering*, vol.20, no. 4, pp. 449-461, April 2008, doi:10.1109/TKDE.2007.190721REFERENCES

- [1] R.B. Ash and C. Doléans-Dade,
Probability and Measure Theory. Academic Press, 2000.- [2] A. Asuncion and D.J. Newman, “UCI Machine Learning Repository,” http://www.ics.uci.edu/~mlearnMLRepository.html , 2007.
- [3] E. Blanzieri and F. Ricci, “Probability-Based Metrics for Nearest Neighbor Classification and Case-Based Reasoning,”
Lecture Notes in Computer Science 1650, pp. 14-29, 1999.- [4] R.O. Duda and P.E. Hart,
Pattern Classification and Scene Analysis. John Wiley & Sons, 1973.- [5] W. Feller,
An Introduction to Probability Theory and Its Applications. Wiley, 1968.- [6] Wikimedia Foundation,
Wikipedia: The Free Encyclopedia, http:/www.wikipedia.org, 2007.- [7] A. Goldberger, L. Amaral, L. Glass, J. Hausdorff, P. Ivanov, R. Mark, J.E. Mietus, G.B. Moody, C.K. Peng, and H.E. Stanley, “PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resources for Complex Physiologic Signals,”
Circulation, vol. 101, no. 23, pp. e215-e220, 2000.- [8] D.J. Hand,
Discrimination and Classification. Wiley, 1981.- [9] J. Handl and J. Knowles, “Cluster Generators: Synthetic Data for the Evaluation of Clustering Algorithms,” http://dbkweb.ch. umist.ac.uk/handlgenerators /, 2007.
- [11] E. Keogh and T. Folias, “The UCR Time Series Data Mining Archive,” 2002, http://www.cs.ucr.edu/~eamonn/TSDMAindex.html ,
- [12] E. Keogh and S. Kasetty, “On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration,”
Proc. ACM SIGKDD, 2002.- [13] E. Keogh, S. Lonardi, and C. Ratanamahatana, “Towards Parameter-Free Data Mining,”
Proc. SIGKDD, 2004.- [14] Z. Lin and H. Wang, “All Common Subtrees,” technical report, 2007.
- [15] C. Stanfill and D. Waltz, “Toward Memory-Based Reasoning,”
Comm. ACM, vol. 29, pp. 1213-1229, 1986.- [17] H. Wang and W. Dubitzky, “A Flexible and Robust Similarity Measure Based on Contextual Probability,”
Proc. 19th Int'l Joint Conf. Artificial Intelligence (IJCAI '05), pp. 27-32, 2005.- [19] H. Wang, “All Common Subsequences,”
Proc. 21st Int'l Joint Conf. Artificial Intelligence (IJCAI '07), pp. 635-640, 2007.- [20] A. Webb,
Statistical Pattern Recognition. Wiley, 2004.- [21] D. Randal Wilson and T.R. Martinez, “Improved Heterogeneous Distance Functions,”
J. Artificial Intelligence Research, vol. 6, pp. 1-34, 1997. |