This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Efficient Subspace Sampling Framework for High-Dimensional Data Reduction, Selectivity Estimation, and Nearest-Neighbor Search
October 2004 (vol. 16 no. 10)
pp. 1247-1262
Data reduction can improve the storage, transfer time, and processing requirements of very large data sets. One of the challenges of designing effective data reduction techniques is to be able to preserve the ability to use the reduced format directly for a wide range of database and data mining applications. In this paper, we propose the novel idea of hierarchical subspace sampling in order to create a reduced representation of the data. The method is naturally able to estimate the local implicit dimensionalities of each point very effectively and, thereby, create a variable dimensionality reduced representation of the data. Such a technique is very adaptive about adjusting its representation depending upon the behavior of the immediate locality of a data point. An important property of the subspace sampling technique is that the overall efficiency of compression improves with increasing database size. Because of its sampling approach, the procedure is extremely fast and scales linearly both with data set size and dimensionality. We propose new and effective solutions to problems such as selectivity estimation and approximate nearest-neighbor search. These are achieved by utilizing the locality specific subspace characteristics of the data which are revealed by the subspace sampling technique.

[1] D. Achlioptas, Database-Friendly Random Projections Proc. ACM PODS Conf., 2001.
[2] C.C. Aggarwal, Hierarchical Subspace Sampling: A Unified Framework for High Dimensional Data Reduction, Selectivity Estimation, and Nearest-Neighbor Search Proc. ACM SIGMOD Conf., 2002.
[3] C.C. Aggarwal and P.S. Yu, Finding Generalized Projected Clusters in High Dimensional Spaces Proc. ACM SIGMOD Conf., 2000.
[4] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules in Large Databases Proc. Very Large Data Bases Conf., 1994, also Proc. ACM SIGMOD Conf., 2000.
[5] S. Babu, M. Garofalakis, and R. Rastogi, SPARTAN: A Model-Based Semantic Compression for Massive Data Tables Proc. ACM SIGMOD Conf., 2001.
[6] L. Brieman, Bagging Predictors Machine Learning, vol. 24, pp. 123-140, 1996.
[7] N. Beckman, H.-P. Kriegel, R. Schneider, and B. Seeger, The R*-Tree: An Efficient and Robust Method for Points and Rectangles Proc. ACM SIGMOD Conf., pp. 322-331, 1990.
[8] E. Bingham and H. Mannila, Random Projection in Dimensionality Reduction: Applications to Image and Text Data Proc. ACM Knowledge Discovery and Data Mining Conf., 2001.
[9] K.P. Chan and A. Fu, “Efficient Time Series Matching by Wavelets,” Proc. Int'l Conf. Data Eng., 1999.
[10] K. Chakrabarti and S. Mehrotra, Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces Proc. Very Large Data Bases Conf., 2000.
[11] A. Deshpande, M. Garofalakis, and R. Rastogi, Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data Proc. ACM SIGMOD Conf., 2001.
[12] C. Faloutsos and K.-I. Lin, FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets Proc. ACM SIGMOD Conf., 1995.
[13] D. Gunopulos, G. Kollios, V. Tsotras, and C. Domeniconi, Approximating Multi-Dimensional Aggregate Range Queries over Real Attributes Proc. SIGMOD Conf., 2000.
[14] A. Guttman, R-Trees: A Dynamic Index Structure for Spatial Searching Proc. ACM SIGMOD Conf., pp. 47-57, 1984.
[15] P. Indyk and R. Motwani, Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality ACM STOC Proc., pp. 604-613, 1998.
[16] H.V. Jagadish, J. Madar, and R. Ng, Semantic Compression and Pattern Extraction with Fascicles Proc. Very Large Data Bases Conf., 1999.
[17] W. Johnson and J. Lindenstrauss, Extensions of Lipschitz Mapping into a Hilbert Space Proc. Conf. Modern Analysis and Probability, pp. 189-206, 1984.
[18] I.T. Jolliffee, Principal Component Analysis. New York: Springer-Verlag, 1986.
[19] E. Keogh, S. Chu, and M. Pazzini, Ensemble-Index: A New Approach to Indexing Large Databases Proc. ACM SIGKDD Conf., 2001.
[20] S. Berchtold, D.A. Keim, and H.-P. Kriegel, The X-Tree: An Index Structure for High-Dimensional Data Proc. Very Large Data Bases Conf., 1996.
[21] K.-I. Lin, H.V. Jagadish, and C. Faloutsos, The TV-Tree: An Index Structure for High Dimensional Data VLDB J., vol. 3, no. 4, pp. 517-542, 1992.
[22] D.A. Keim and M. Heczko, Wavelets and Their Applications in Databases Proc. Int'l Conf. Data Eng., 2001.
[23] E.J. Keogh, K. Chakrabarti, M. Pazzini, and S. Mehrotra, Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases Knowledge and Information Systems, vol. 3, no. 3, pp. 263-286, 2000.
[24] E.J. Keogh, K. Chakrabarti, M. Pazzini, and S. Mehrotra, Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases Proc. ACM SIGMOD Conf., 2001.
[25] N. Roussopoulos, S. Kelley, and F. Vincent, Nearest-Neighbor Queries Proc. ACM SIGMOD Conf., pp. 71-79, 1995.
[26] C.H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, Latent Semantic Indexing: A Probabilistic Analysis Proc. ACM PODS Conf., 1998.
[27] V. Poosala and Y. Ioannidis, Selectivity Estimation without the Attribute Value Independence Assumption Proc. Very Large Data Bases Conf., 1997.
[28] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita, Improved Histograms for Selectivity Estimation of Range Predicates Proc. ACM SIGMOD Conf., 1996.
[29] K.V. Ravi Kanth, D. Agrawal, and A. Singh, Dimensionality Reduction for Similarity Searching in Dynamic Databases Proc. SIGMOD Conf., 1998.
[30] S.T. Roweis and L.K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding Science, vol. 290, pp. 2323-2326, Dec. 2000.
[31] J.B. Tenenbaum, V. Silva, and J.C. Langford, A Global Geometric Framework for Nonlinear Dimensionality Reduction Science, vol. 290, Dec. 2000.
[32] D. Wu, D. Agrawal, and A. Abbadi, A Comparison of DFT and DWT Based Similarity Search in Time Series Databases Proc. Ninth Int'l Conf. Information and Knowledge Management, 2000.
[33] J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression," IEEE Trans. Information Theory, vol. 23, no. 3, pp. 337-343, 1977.

Index Terms:
High dimensions, dimensionality reduction, nearest-neighbor search, selectivity estimation.
Citation:
Charu C. Aggarwal, "An Efficient Subspace Sampling Framework for High-Dimensional Data Reduction, Selectivity Estimation, and Nearest-Neighbor Search," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 10, pp. 1247-1262, Oct. 2004, doi:10.1109/TKDE.2004.49
Usage of this product signifies your acceptance of the Terms of Use.