Issue No.05 - September/October (2006 vol.21)
Hamid Haidarian Shahri , University of Maryland
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIS.2006.90
Approximate duplicate elimination is an important data-integration task, but its complex comparisons of many records involving uncertainty and ambiguity make it difficult. Earlier approaches required a time-consuming and tedious process of hard coding of static rules based on a schema. A novel duplicate-elimination framework now lets users clean data flexibly and effortlessly, without any coding. Exploiting fuzzy inference inherently handles the problem's uncertainty, and unique machine learning capabilities let the framework adapt to the specific notion of similarity appropriate for each domain. The framework is extensible and accommodative, letting the user operate with or without training data. Additionally, many of the previous methods for duplicate elimination can be implemented quickly using this framework.
database applications, data mining, knowledge management applications, uncertainty, fuzzy and probabilistic reasoning, data warehouse and repository
Hamid Haidarian Shahri, "Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework", IEEE Intelligent Systems, vol.21, no. 5, pp. 63-71, September/October 2006, doi:10.1109/MIS.2006.90