The Community for Technology Leaders
Green Image
Issue No. 04 - April (2017 vol. 29)
ISSN: 1041-4347
pp: 727-742
Shuang Hao , Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China
Nan Tang , Qatar Computing Research Institute, Hamad Bin Khalifa Univeristy, Doha, Qatar
Guoliang Li , Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China
Jian He , Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China
Na Ta , Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China
Jianhua Feng , Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China
ABSTRACT
Integrity constraint based data repairing is an iterative process consisting of two parts: detect and group errors that violate given integrity constraints (ICs); and modify values inside each group such that the modified database satisfies those ICs. However, most existing automatic solutions treat the process of detecting and grouping errors straightforwardly (e.g., violations of functional dependencies using string equality), while putting more attention on heuristics of modifying values within each group. In this paper, we propose a revised semantics of violations and data consistency w.r.t. a set of ICs. The revised semantics relies on string similarities, in contrast to traditional methods that use syntactic error detection using string equality. Along with the revised semantics, we also propose a new cost model to quantify the cost of data repair by considering distances between strings. We show that the revised semantics provides a significant change for better detecting and grouping errors, which in turn improves both precision and recall of the following data repairing step. We prove that finding minimum-cost repairs in the new model is NP-hard, even for a single FD. We devise efficient algorithms to find approximate repairs. In addition, we develop indices and optimization techniques to improve the efficiency. Experiments show that our approach significantly outperforms existing automatic repair algorithms in both precision and recall.
INDEX TERMS
Maintenance engineering, Urban areas, Semantics, Integrated circuits, Databases, Education, Fault tolerance,maximal independent set, Data repairing, functional dependencies, fault-tolerant violation, graph model
CITATION
Shuang Hao, Nan Tang, Guoliang Li, Jian He, Na Ta, Jianhua Feng, "A Novel Cost-Based Model for Data Repairing", IEEE Transactions on Knowledge & Data Engineering, vol. 29, no. , pp. 727-742, April 2017, doi:10.1109/TKDE.2016.2637928
164 ms
(Ver 3.3 (11022016))