2017 IEEE 33rd International Conference on Data Engineering (2017)
San Diego, California, USA
April 19, 2017 to April 22, 2017
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDE.2017.31
Integrity constraint (IC) based data repairing is typically an iterative process consisting of two parts: detecting and grouping errors that violate given ICs, and modifying values inside each group such that the modified database satisfies those ICs. However, most existing automatic solutions treat the process of detecting and grouping errors straightforwardly (e.g., violations of functional dependencies using string equality), while putting more attention on heuristics of modifying values within each group. In this paper, we propose a revised semantics of violations and data consistency w.r.t. a set of ICs. The revised semantics relies on string similarities, in contrast to traditional methods that use syntactic error detection using string equality. Along with the revised semantics, we also propose a new cost model to quantify the cost of data repairing by considering distances between strings. We show that the revised semantics provides a significant change for better detecting and grouping errors, which in turn improves both precision and recall of the following data repairing step. We prove that finding minimumcost repairs in the new model is NP-hard, even for a single FD. We devise efficient algorithms to find approximate repairs.
Maintenance engineering, Databases, Semantics, Data models, Approximation algorithms, Integrated circuits, Cleaning
S. Hao, N. Tang, G. Li, J. He, N. Ta and J. Feng, "A Novel Cost-Based Model for Data Repairing," 2017 IEEE 33rd International Conference on Data Engineering(ICDE), San Diego, California, USA, 2017, pp. 49-50.