Long Beach, CA, USA
Mar. 1, 2010 to Mar. 6, 2010
Wook-Shin Han , Department of Computer Engineering, Kyungpook National University, Korea
Wooseong Kwak , Department of Computer Engineering, Kyungpook National University, Korea
Hwanjo Yu , Department of Computer Science and Engineering, POSTECH, Korea
Commercial tuple extraction systems have enjoyed some success to extract tuples by regarding HTML pages as tree structures and exploiting XPath queries to find attributes of tuples in the HTML pages. However, such systems would be vulnerable to small changes on the web pages. In this paper, we propose a robust tuple extraction system which utilizes spatial relationships among elements rather than the XPath queries of the elements. Our system regards elements in the rendered page as spatial objects in the 2-D space and executes spatial joins to extract target elements. Since humans also identify an element in a web page by its relative spatial location, our system extracting elements by their spatial relationships could possibly be as robust as manual extraction and is far more robust than existing tuple extraction systems.
Wook-Shin Han, Wooseong Kwak, Hwanjo Yu, "On supporting effective web extraction", ICDE, 2010, 2013 IEEE 29th International Conference on Data Engineering (ICDE), 2013 IEEE 29th International Conference on Data Engineering (ICDE) 2010, pp. 773-775, doi:10.1109/ICDE.2010.5447932