2016 IEEE 32nd International Conference on Data Engineering (ICDE) (2016)
May 16, 2016 to May 20, 2016
Stefano Ortona , Department of Computer Science, Oxford University, United Kingdom
Giorgio Orsi , School of Computer Science, University of Birmingham, United Kingdom
Tim Furche , Department of Computer Science, Oxford University, United Kingdom
Marcello Buoncristiano , Dipartimento di Matematica, Informatica ed Economia, Universita della Basilicata, Italy
Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.
Maintenance engineering, Data mining, Data acquisition, Runtime, Cleaning, Web pages, Computer science
S. Ortona, G. Orsi, T. Furche and M. Buoncristiano, "Joint repairs for web wrappers," 2016 IEEE 32nd International Conference on Data Engineering (ICDE), Helsinki, Finland, 2016, pp. 1146-1157.