The Community for Technology Leaders
RSS Icon
Issue No.04 - July/August (2010 vol.14)
pp: 23-31
Hanna Köpcke , University of Leipzig
Andreas Thor , University of Leipzig
Erhard Rahm , University of Leipzig
Entity matching is a key task for data integration and especially challenging for Web data. Effective entity matching typically requires combining several match techniques and finding suitable configuration parameters, such as similarity thresholds. The authors investigate to what degree machine learning helps semi-automatically determine suitable match strategies with a limited amount of manual training effort. They use a new framework, Fever, to evaluate several learning-based approaches for matching different sets of Web data entities. In particular, they study different approaches for training-data selection and how much training is needed to find effective combined match strategies and configurations.
Web data integration, entity matching, machine learning
Hanna Köpcke, Andreas Thor, Erhard Rahm, "Learning-Based Approaches for Matching Web Data Entities", IEEE Internet Computing, vol.14, no. 4, pp. 23-31, July/August 2010, doi:10.1109/MIC.2010.58
1. C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques, Data-Centric Systems and Applications Series, Springer, 2006.
2. A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, 2007, pp. 1–16.
3. H. Köpcke and E. Rahm, "Frameworks for Entity Matching: A Comparison," Data & Knowledge Eng., vol. 96, no. 2, 2010, pp. 197–210.
4. N. Koudas, S. Sarawagi, and D. Srivastava, "Record Linkage: Similarity Measures and Algorithms," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2006, pp. 802–803.
5. M. Bilenko, S. Basu, and M. Sahami, "Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping," Proc. 5th IEEE Int'l Conf. Data Mining (ICDM 05), IEEE CS Press, 2005, pp. 58–65.
6. M. Bilenko and R.J. Mooney, "On Evaluation and Training-Set Construction for Duplicate Detection," KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003, pp. 7–12.
7. S. Chaudhuri et al., "Example-Driven Design of Efficient Record Matching Queries," Proc. 33rd Int'l Conf. Very Large Databases (VLDB 07), ACM Press, 2007, pp. 327–338.
8. H. Köpcke, A. Thor, and E. Rahm, "Comparative Evaluation of Entity Resolution Approaches with FEVER," Proc. Very Large Databases Conf., ACM Press, 2009, pp. 1574–1577 (demo paper).
9. A. Thor and E. Rahm, "MOMA — A Mapping-Based Object Matching System," Proc. 3rd Biennial Conf. Innovative Data Systems Research (CIDR 07), 2007, pp. 248–258.
10. H. Köpcke and E. Rahm, "Training Selection for Tuning Entity Matching," Proc. Int'l Workshop Quality in Databases and Management of Uncertain Data (QDB/MUD 08), 2008, pp. 3–12.
11. R. Caruana and A. Niculescu-Mizil, "An Empirical Comparison of Supervised Learning Algorithms, Proc. 23rd Int'l Conf. Machine Learning (ICML 06), ACM Press, 2006, pp. 161–168.
12. I. Mierswa et al., "Rapid Prototyping for Complex Data Mining Tasks," Proc. 12th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, ACM Press, 2006, pp. 935–940.
13. H. Halteren, W. Daelemans, and J. Zavrel, "Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems," Computational Linguistics, vol. 27, no. 2, 2001, pp. 199–229.
5 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool