The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2008 vol.20)
pp: 449-461
ABSTRACT
A novel similarity, neighborhood counting measure, was recently proposed which counts the number of neighborhoods of a pair of data points. This similarity can handle numerical and categorical attributes in a conceptually uniform way, can be calculated efficiently through a simple formula, and gives good performance when tested in the framework of k-nearest neighbor classifier. In particular it consistently outperforms a combination of the classical Euclidean distance and Hamming distance. This measure was also shown to be related to a probability formalism, G probability, which is induced from a target probability function P. It was however unclear how G is related to P, especially for classification. Therefore it was not possible to explain some characteristic features of the neighborhood counting measure. In this paper we show that G is a linear function of P, and G-based Bayes classification is equivalent to P-based Bayes classification. We also show that the k-nearest neighbor classifier, when weighted by the neighborhood counting measure, is in fact an approximation of the G-based Bayes classifier, and furthermore, the P-based Bayes classifier. Additionally we show that the neighborhood counting measure remains unchanged when binary attributes are treated as categorical or numerical data. This is a feature that is not shared by other distance measures, to the best of our knowledge. This study provides a theoretical insight into the neighborhood counting measure.
INDEX TERMS
Decision support, Clustering, classification, and association rules
CITATION
Hui Wang, Fionn Murtagh, "A Study of the Neighborhood Counting Similarity", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 4, pp. 449-461, April 2008, doi:10.1109/TKDE.2007.190721
REFERENCES
[1] R.B. Ash and C. Doléans-Dade, Probability and Measure Theory. Academic Press, 2000.
[2] A. Asuncion and D.J. Newman, “UCI Machine Learning Repository,” http://www.ics.uci.edu/~mlearnMLRepository.html , 2007.
[3] E. Blanzieri and F. Ricci, “Probability-Based Metrics for Nearest Neighbor Classification and Case-Based Reasoning,” Lecture Notes in Computer Science 1650, pp. 14-29, 1999.
[4] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. John Wiley & Sons, 1973.
[5] W. Feller, An Introduction to Probability Theory and Its Applications. Wiley, 1968.
[6] Wikimedia Foundation, Wikipedia: The Free Encyclopedia, http:/www.wikipedia.org, 2007.
[7] A. Goldberger, L. Amaral, L. Glass, J. Hausdorff, P. Ivanov, R. Mark, J.E. Mietus, G.B. Moody, C.K. Peng, and H.E. Stanley, “PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resources for Complex Physiologic Signals,” Circulation, vol. 101, no. 23, pp. e215-e220, 2000.
[8] D.J. Hand, Discrimination and Classification. Wiley, 1981.
[9] J. Handl and J. Knowles, “Cluster Generators: Synthetic Data for the Evaluation of Clustering Algorithms,” http://dbkweb.ch. umist.ac.uk/handlgenerators /, 2007.
[10] M.C Jones, J.S. Marron, and S.J. Sheather, “A Brief Survey of Bandwidth Selection for Density Estimation,” J. Am. Statistical Assoc., vol. 91, pp. 401-407, 1996.
[11] E. Keogh and T. Folias, “The UCR Time Series Data Mining Archive,” 2002, http://www.cs.ucr.edu/~eamonn/TSDMAindex.html ,
[12] E. Keogh and S. Kasetty, “On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration,” Proc. ACM SIGKDD, 2002.
[13] E. Keogh, S. Lonardi, and C. Ratanamahatana, “Towards Parameter-Free Data Mining,” Proc. SIGKDD, 2004.
[14] Z. Lin and H. Wang, “All Common Subtrees,” technical report, 2007.
[15] C. Stanfill and D. Waltz, “Toward Memory-Based Reasoning,” Comm. ACM, vol. 29, pp. 1213-1229, 1986.
[16] H. Wang, “Nearest Neighbors by Neighborhood Counting,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 6, pp.942-953, June 2006.
[17] H. Wang and W. Dubitzky, “A Flexible and Robust Similarity Measure Based on Contextual Probability,” Proc. 19th Int'l Joint Conf. Artificial Intelligence (IJCAI '05), pp. 27-32, 2005.
[18] H. Wang, I. Düntsch, G. Gediga, and A. Skowron, “Hyperrelations in Version Space,” Int'l J. Approximate Reasoning, vol. 36, no. 3, pp.223-241, 2004.
[19] H. Wang, “All Common Subsequences,” Proc. 21st Int'l Joint Conf. Artificial Intelligence (IJCAI '07), pp. 635-640, 2007.
[20] A. Webb, Statistical Pattern Recognition. Wiley, 2004.
[21] D. Randal Wilson and T.R. Martinez, “Improved Heterogeneous Distance Functions,” J. Artificial Intelligence Research, vol. 6, pp. 1-34, 1997.
15 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool