Subscribe
Issue No.04 - April (2009 vol.21)
pp: 523-536
Chih-Ming Hsu , National Taiwan University, Taipei
Ming-Syan Chen , National Taiwan University, Taipei
ABSTRACT
Effective distance functions in high dimensional data space are very important in solutions for many data mining problems. Recent research has shown that if the Pearson variation of the distance distribution converges to zero with increasing dimensionality, the distance function will become unstable (or meaningless) in high dimensional space, even with the commonly used $L_p$ metric in the Euclidean space. This result has spawned many studies the along the same lines. However, the necessary condition for unstability of a distance function, which is required for function design, remains unknown. In this paper, we shall prove that several important conditions are in fact equivalent to unstability. Based on these theoretical results, we employ some effective and valid indices for testing the stability of a distance function. In addition, this theoretical analysis inspires us that unstable phenomena are rooted in variation of the distance distribution. To demonstrate the theoretical results, we design a meaningful distance function, called the Shrinkage-Divergence Proximity (SDP), based on a given distance function. It is shown empirically that the SDP significantly outperforms other measures in terms of stability in high dimensional data space, and is thus more suitable for distance-based clustering applications.
INDEX TERMS
Data mining, Feature extraction or construction, Clustering
CITATION
Chih-Ming Hsu, Ming-Syan Chen, "On the Design and Applicability of Distance Functions in High-Dimensional Data Space", IEEE Transactions on Knowledge & Data Engineering, vol.21, no. 4, pp. 523-536, April 2009, doi:10.1109/TKDE.2008.178
REFERENCES
 [1] C.C. Aggarwal, “Re-Designing Distance Functions and Distance-Based Applications for High Dimensional Data,” SIGMOD Record, vol. 30, pp. 13-18, 2001. [2] C.C. Aggarwal, A. Hinneburg, and D.A. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Spaces,” Proc. Eighth Int'l Conf. Database Theory (ICDT '01), vol. 1973, pp. 420-434, 2001. [3] C.C. Aggarwal and P.S. Yu, “The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in High Dimensional Space,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 119-129, 2000. [4] K.P. Bennett, U. Fayyad, and D. Geiger, “Density-Based Indexing for Approximate Nearest-Neighbor Queries,” Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD'99), pp. 233-243, 1999. [5] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is Nearest Neighbors Meaningful?” Proc. Seventh Int'l Conf. Database Theory (ICDT '99), vol. 1540, pp. 217-235, 1999. [6] A. Hinneburg, C.C. Aggarwal, and D.A. Keim, “What is the Nearest Neighbor in High Dimensional Spaces?” Proc. 26th Int'l Conf. Very Large Data Bases (VLDB '00), pp. 506-515, 2000. [7] C.C. Aggarwal, “Towards Systematic Design of Distance Functions for Data Mining Applications,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD'03), pp. 9-18, 2003. [8] D. Francois, M.-V. Wertz, and S.M.-M. Verleysen, “The Concentration of Fractional Distances,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 7, pp. 873-886, July 2007. [9] N. Katayama and S. Satoh, “Distinctiveness-Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information,” Proc. 17th Int'l Conf. Data Eng. (ICDE '01), pp. 493-502, 2001. [10] M. Ledoux, The Concentration of Measure Phenomenon. Am. Math. Soc., 2001. [11] V.D. Milman and G. Schechtman, Asymptotic Theory of Finite Dimensional Normed Spaces. Springer-Verlag, 1986. [12] V. Pestov, “On the Geometry of Similarity Search: Dimensionality Curse and Concentration of Measure,” Information Processing Letters, vol. 73, nos. 1/2, pp. 47-51, 2000. [13] A.C. Tamhane and D.D. Dunlop, Statistics and Data Analysis: From Elementary to Intermediate. Prentice Hall, 2000. [14] W.Q. Meeker and L.A. Meeker, Statistical Methods for Reliability Data. Wiley, 1998. [15] W.J. Ewens and G.R. Grant, Statistical Methods in Bioinformatics, second ed. Springers, 2005. [16] P.J. Bickel and K.A. Doksum, Mathematical Statistics: Basic Ideas and Selected Topics, second ed., vol. 1. Prentice Hall, 2001. [17] V.K. Rohatgi and A.K.M.E. Saleh, Introduction to Probability and Statistics, second ed. Wiley, 2001. [18] G.G. Roussas, A Course in Mathematical Statistics, second ed. Academic Press, 1997. [19] B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Chapman and Hall, 1993. [20] E.L. Lehmann, Testing Statistical Hypotheses, second ed. 1997. [21] T.F. Cox and M.A. Cox, Multidimensional Scaling. Chapman and Hall, 1994. [22] I. Guyon, S. Gunn, M. Nikravesh, and L.A. Zadeh, Feature Extraction: Foundations and Applications. Springer-Verlag, 2006. [23] L. Ertöz, M. Steinbach, and V. Kumar, “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data,” Proc. Third SIAM Int'l Conf. Data Mining (SDM), 2003. [24] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999. [25] H. Zhou and D. Woodruff, “Clustering via Matrix Powering,” Proc. 23rd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS '04), pp. 136-142, 2004. [26] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer, 2001. [27] K.L. Chung, A Course in Probability Theory, third ed. Academic Press, 2001. [28] H.A. David and H.N. Nagaraja, Order Statistics, third ed. John Wiley & Sons, 2003.