The Community for Technology Leaders
Green Image
Issue No. 12 - Dec. (2018 vol. 30)
ISSN: 1041-4347
pp: 2408-2420
Gokhan Kul , Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY
Duc Thanh Anh Luong , Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY
Ting Xie , Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY
Varun Chandola , Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY
Oliver Kennedy , Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY
Shambhu Upadhyaya , Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY
ABSTRACT
Database access logs are the starting point for many forms of database administration, from database performance tuning, to security auditing, to benchmark design, and many more. Unfortunately, query logs are also large and unwieldy, and it can be difficult for an analyst to extract broad patterns from the set of queries found therein. Clustering is a natural first step towards understanding the massive query logs. However, many clustering methods rely on the notion of pairwise similarity, which is challenging to compute for SQL queries, especially when the underlying data and database schema is unavailable. We investigate the problem of computing similarity between queries, relying only on the query structure. We conduct a rigorous evaluation of three query similarity heuristics proposed in the literature applied to query clustering on multiple query log datasets, representing different types of query workloads. To improve the accuracy of the three heuristics, we propose a generic feature engineering strategy, using classical query rewrites to standardize query structure. The proposed strategy results in a significant improvement in the performance of all three similarity heuristics.
INDEX TERMS
Measurement, Security, Task analysis, Tuning, Benchmark testing, Indexes
CITATION

G. Kul, D. T. Luong, T. Xie, V. Chandola, O. Kennedy and S. Upadhyaya, "Similarity Metrics for SQL Query Clustering," in IEEE Transactions on Knowledge & Data Engineering, vol. 30, no. 12, pp. 2408-2420, 2018.
doi:10.1109/TKDE.2018.2831214
346 ms
(Ver 3.3 (11022016))