This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)
ProbClean: A probabilistic duplicate detection system
Long Beach, CA, USA
March 01-March 06
ISBN: 978-1-4244-5445-7
George Beskales, School of Computer Science, University of Waterloo, Canada
Mohamed A. Soliman, School of Computer Science, University of Waterloo, Canada
Ihab F. Ilyas, School of Computer Science, University of Waterloo, Canada
Shai Ben-David, School of Computer Science, University of Waterloo, Canada
Yubin Kim, School of Computer Science, University of Waterloo, Canada
One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.
Citation:
George Beskales, Mohamed A. Soliman, Ihab F. Ilyas, Shai Ben-David, Yubin Kim, "ProbClean: A probabilistic duplicate detection system," icde, pp.1193-1196, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010
Usage of this product signifies your acceptance of the Terms of Use.