This Article 
 Bibliographic References 
 Add to: 
Access Structures for Angular Similarity Queries
November 2006 (vol. 18 no. 11)
pp. 1512-1525
Angular similarity measures have been utilized by several database applications to define semantic similarity between various data types such as text documents, time-series, images, and scientific data. Although similarity searches based on Euclidean distance have been extensively studied in the database community, processing of angular similarity searches has been relatively untouched. Problems due to a mismatch in the underlying geometry as well as the high dimensionality of the data make current techniques either inapplicable or their use results in poor performance. This brings up the need for effective indexing methods for angular similarity queries. We first discuss how to efficiently process such queries and propose effective access structures suited to angular similarity measures. In particular, we propose two classes of access structures, namely, Angular-sweep and Cone-shell, which perform different types of quantization based on the angular orientation of the data objects. We also develop query processing algorithms that utilize these structures as dense indices. The proposed techniques are shown to be scalable with respect to both dimensionality and the size of the data. Our experimental results on real data sets from various applications show two to three orders of magnitude of speedup over the current techniques.

[1] R.K. Ando, “Latent Semantic Space: Iterative Scaling Improves Precision of Inter-Document Similarity Measurement,” Proc. 23rd ACM SIGIR Conf., pp. 216-223, 2000.
[2] M. Andrec, P. Du, and R.M. Levy, “Protein Structural Motif Recognition via NMR Residual Dipolar Couplings,” J. Am. Chemical Soc., no. 123, pp. 1222-1229, 2001.
[3] D. Androutsos, K. Plataniotis, and A. Venetsanopoulos, “A Novel Vector-Based Approach to Color Image Retrieval Using a Vector Angular-Based Distance Measure,” Computer Vision and Image Understanding, vol. 75, nos. 1/2, pp. 46-58, Aug. 1999.
[4] A. Baruffolo, “R-Trees for Astronomical Data Indexing,” ASP Conf. Ser., Astronomical Data Analysis Software and Systems VII, pp.375, 1999.
[5] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, “The ${\rm R}^{\ast}$ Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 322-331, May 1990.
[6] S. Berchtold, C. Bohm, H. Jagadish, H. Kriegel, and J. Sander, “Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces,” Proc. 16th Int'l Conf. Data Eng., pp. 577-588, 2000.
[7] S. Berchtold, C. Bohm, H.V. Jagadish, H.-P. Kriegel, and J. Sander, “Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces,” Proc. Int'l Conf. Data Eng., pp.577-588, 2000.
[8] S. Berchtold, C. Bohm, D. Keim, and H. Kriegel, “A Cost Model for Nearest Neighbor Search in High-Dimensional Data Space,” Proc. ACM Symp. Principles of Database Systems, pp. 78-86, June 1997.
[9] S. Berchtold, C. Bohm, and H.-P. Kriegel, “The Pyramid-Technique: Towards Breaking the Curse of Dimensionality,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 142-153, June 1998.
[10] P.V. Biron and D.H. Kraft, “New Methods for Relevance Feedback: Improving Information Retrieval Performance,” Proc. ACM Symp. Applied Computing, pp. 482-487, 1995.
[11] I. Choi, J. Kwon, and S. Kim, “Local Feature Frequency Profile: A Method to Measure Structural Similarity in Proteins,” Proc. Nat'l Aacademy of Sciences of the US, no. 101, pp. 3797-3802, Mar. 2004.
[12] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Space,” Proc. 23rd Very Large Data Bases Conf., pp. 426-435, 1997.
[13] A. Debelack, J. Dehn, L. Muchinsky, and D. Smith, “Next Generation Air Traffic Control Automation,” IBM Systems J., vol. 24, no. 1, pp. 63-77, 1995.
[14] S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman, “Indexing by Latent Semantic Analysis,” J. Am. Soc. Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[15] D. Hull, “Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing,” Proc. 17th ACM-SIGIR Conf., pp. 282-291, 1994.
[16] M.P.T. Doom and M. Raymer, “GA-Facilitated Knowledge Discovery and Pattern Recognition Optimization Applied to the Biochemistry of Protein Solvation,” Proc. Genetic and Evolutionary Computation Conf., pp. 426-437, May 2004.
[17] A. Dumitras and A.N. Venetsanopoulos, “Angular Map-Driven Snakes with Application to Object Shape Description in Color Images,” IEEE Trans. Image Processing, vol. 10, no. 12, pp. 1851-1859, Dec. 2001.
[18] H. Ferhatosmanoglu, D. Agrawal, and A.E. Abbadi, “Concentric Hyperspaces and Disk Allocation for Fast Parallel Range Searching,” Proc. Int'l Conf. Data Eng., pp. 608-615, Mar. 1999.
[19] H. Ferhatosmanoglu, D. Agrawal, and A.E. Abbadi, “Efficient Processing of Conical Queries,” Proc. ACM Conf. Information and Knowledge Management (CIKM), pp. 1-8, 2001.
[20] W.M.G. Dummy, “Introduction to Statistics for Bioinformatics, with Applications to Expression Microarrays,” CSB Tutorial Sessions, 2003.
[21] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 47-57, 1984.
[22] F. Korn, H.V. Jagadish, and C. Faloutsos, “Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences,“ Proc. ACM-SIGMOD Int'l Conf. Management of Data, pp. 289-300, 1997.
[23] P. Kostiuk, M. Adams, D. Allinger, G. Rosch, and J. Kuchar, “An Integrated Safety Analysis Methodology for Emerging Air Transport Technologies,” NASA/CR-1998-207661, p. 66, Apr. 1998.
[24] LEDAS, Leicester Database and Archive Service, http:/, 2004.
[25] P.O.B.M.B. Eisen, P.T. Spellman, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Aacademy of Sciences of the US., no. 95, pp. 14863-14868, Dec. 1998.
[26] B.S. Manjunath, “Airphoto Dataset,” http://vivaldi.ece.ucsb. edu/Manjunathresearch.htm , 2000.
[27] Z.D. Michael, W. Berry, and E.R. Jessup, “Matrices, Vector Spaces and Information Retrieval,” SIAM Rev., vol. 41, no. 2, pp. 335-362, 1999.
[28] J. Nievergelt, H. Hans, and K.C. Sevcik, “The Grid File: An Adaptable, Symmetric Multikey File Structure,” ACM Trans. Database Systems, vol. 9, no. 1, pp. 38-71, 1984.
[29] D.K.P.E. Trahanias and A.N. Venetsanopoulos, “Directional Processing of Color Images: Theory and Experimental Results,” IEEE Trans. Image Processing, vol. 5, no. 6, pp. 868-880, Oct. 1996.
[30] G. Qian, S. Sural, Y. Gu, and S. Pramanik, “Similarity between Euclidean and Cosine Angle Distance for Nearest Neighbor Queries,” Proc. ACM Symp. Applied Computing, pp. 1232-1237, 2004.
[31] A. Razdan, “Wavelet Correlation Coefficient of Strongly Correlated Financial Time Series,” ArXiv:Cond-Mat, no. 2, Oct. 2003.
[32] Y. Sakurai, M. Yoshikawa, S. Uemura, and H. Kojima, “The A-Tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation,” Proc. 26th Int'l Conf. Very Large Data Bases, pp. 516-526, 2000.
[33] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983.
[34] A. Singhal, “Modern Information Retrieval: A Brief Overview,” IEEE Data Eng. Bull., vol. 24, no. 4, pp. 35-43, 2001.
[35] V. Subrahmanian, Principles of Multimedia Database Systems. San Francisco: Morgan Kaufmann, 1999.
[36] A.S. Szalay, J. Gray, and A.R. Thakar, “The sdss Skyserver: Public Access to the Sloan Digital Sky Server Data,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 570-581, 2002.
[37] C. Tang, Z. Xu, and M. Mahalingam, “pSearch: Information Retrieval in Structured Overlays,” Proc. ACM HotNets-I, Oct. 2002.
[38] P.E. Trahanias and A.N. Venetsanopoulos, “Vector Directional Filters: A New Class of Multichannel Image Processing Filters,” IEEE Trans. Image Processing, vol. 2, no. 4, pp. 528-534, Oct. 1993.
[39] D.R. Warn, “Lighting Controls for Synthetic Images,” Computer Graphics, vol. 17, no. 3, pp. 13-21, 1983.
[40] R. Weber, H.-J. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,” Proc. Int'l Conf. Very Large Data Bases, pp.194-205, Aug. 1998.
[41] A. Wicenec and M. Albrecht, “Methods for Structuring and Searching Very Large Catalogs,” ASP Conf. Ser., Astronomical Data Analysis Software and Systems VII, no. 145, p. 512, 1998.
[42] R. Wilkinson and P. Hingston, “Using Cosine Distance in a Neural Network for Document Retrieval,” Proc. ACM SIGIR Conf., pp.202-210, 1991.
[43] P. Zhang, Y. Huang, S. Shekhar, and V. Kumar, “Correlation Analysis of Spatial Time Series Datasets: A Filter-and-Refine Approach,” Proc. Seventh Pacific-Asia Conf. Knowledge Discovery and Data Mining, 2003.
[44] P. Zhang, Y. Huang, S. Shekhar, and V. Kumar, “Exploiting Spatial Autocorrelation to Efficiently Process Correlation-Based Similarity Queries,” Proc. Eighth Int'l Symp. Spatial and Temporal Databases, July 2003.

Index Terms:
Angular query, performance, indexing, angular similarity measures, high-dimensional data.
Tan Apaydin, Hakan Ferhatosmanoglu, "Access Structures for Angular Similarity Queries," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, pp. 1512-1525, Nov. 2006, doi:10.1109/TKDE.2006.165
Usage of this product signifies your acceptance of the Terms of Use.