Issue No. 12 - Dec. (2016 vol. 28)
Zeyuan Shang , Department of Computer Science, Tsinghua University, Beijing, China
Yaxiao Liu , Department of Computer Science, Tsinghua University, Beijing, China
Guoliang Li , Department of Computer Science, Tsinghua University, Beijing, China
Jianhua Feng , Department of Computer Science, Tsinghua University, Beijing, China
Similarity join is a fundamental operation in data cleaning and integration. Existing similarity-join methods utilize the string similarity to quantify the relevance but neglect the knowledge behind the data, which plays an important role in understanding the data. Thanks to public knowledge bases, e.g., Freebase and Yago, we have an opportunity to use the knowledge to improve similarity join. To address this problem, we study knowledge-aware similarity join, which, given a knowledge hierarchy and two collections of objects (e.g., documents), finds all knowledge-aware similar object pairs. To the best of our knowledge, this is the first study on knowledge-aware similarity join. There are two main challenges. The first is how to quantify the knowledge-aware similarity. The second is how to efficiently identify the similar pairs. To address these challenges, we first propose a new similarity metric to quantify the knowledge-aware similarity using the knowledge hierarchy. We then devise a filter-and-verification framework to efficiently identify the similar pairs. We propose effective signature-based filtering techniques to prune large numbers of dissimilar pairs and develop efficient verification algorithms to verify the candidates that are not pruned in the filter step. Experimental results on real-world datasets show that our method significantly outperforms baseline algorithms in terms of both efficiency and effectiveness.
Measurement, Time complexity, Cleaning, Knowledge based systems, Upper bound, Clustering algorithms, Databases
Zeyuan Shang, Yaxiao Liu, Guoliang Li, Jianhua Feng, "K-Join: Knowledge-Aware Similarity Join", IEEE Transactions on Knowledge & Data Engineering, vol. 28, no. , pp. 3293-3308, Dec. 2016, doi:10.1109/TKDE.2016.2601325