The Community for Technology Leaders
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010) (2010)
Long Beach, CA, USA
Mar. 1, 2010 to Mar. 6, 2010
ISBN: 978-1-4244-5445-7
pp: 1193-1196
George Beskales , School of Computer Science, University of Waterloo, Canada
Mohamed A. Soliman , School of Computer Science, University of Waterloo, Canada
Ihab F. Ilyas , School of Computer Science, University of Waterloo, Canada
Shai Ben-David , School of Computer Science, University of Waterloo, Canada
Yubin Kim , School of Computer Science, University of Waterloo, Canada
ABSTRACT
One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.
INDEX TERMS
CITATION

G. Beskales, I. F. Ilyas, S. Ben-David, M. A. Soliman and Y. Kim, "ProbClean: A probabilistic duplicate detection system," 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)(ICDE), Long Beach, CA, USA, 2010, pp. 1193-1196.
doi:10.1109/ICDE.2010.5447744
85 ms
(Ver 3.3 (11022016))