Issue No. 02 - March/April (2004 vol. 19)
Mong Li Lee , National University of Singapore
Wynne Hsu , National University of Singapore
Vijay Kothari , National University of Singapore
<p>Data quality problems can arise from abbreviations, data entry mistakes, duplicate records, missing fields, and many other sources. Data-cleaning research has focused on duplicate elimination or the merge/purge problem. Another problem is erroneous data called spurious links, where a real-world entity has multiple record links that might not be properly associated with it. One approach to this problem is to use context information to clean up the spurious links. This approach identifies and retrieves the data containing potential spurious links, then performs a context similarity comparison to determine records with high overlaps. The degree of overlapping context indicates the likelihood of spurious links. Experiments on three real-world data sets demonstrate that this approach can correctly identify spurious links and thus assist data cleaning.</p>
data cleaning, data quality problems, context information
M. L. Lee, V. Kothari and W. Hsu, "Cleaning the Spurious Links in Data," in IEEE Intelligent Systems, vol. 19, no. , pp. 28-33, 2004.