Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073) (2000)
San Diego, California
Feb. 28, 2000 to Mar. 3, 2000
Ling Liu , Georgia Institute ofTechnology
Calton Pu , Georgia Institute ofTechnology
Wei Han , Georgia Institute ofTechnology
This paper describes the methodology and the software development of XWRAP, an XML-enabled wrapper construction system for semi-automatic generation of wrapper programs. By XML-enabled we mean that the metadata about information content that are implicit in the original web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In addition, the query-based content filtering process is performed against the XML documents.The XWRAP wrapper generation framework has three distinct features. First, it explicitly separates tasks of building wrappers that are specific to a Web source from the tasks that are repetitive for any source, and uses a component library to provide basic building blocks for wrapper programs. Second, it provides a user-friendly interface program to allow wrapper developers to generate their wrapper code with a few mouse clicks. Third and most importantly, we introduce and develop a two-phase code generation framework.The first phase utilizes an interactive interface facility to encode the source-specific metadata knowledge identified by individual wrapper developers as declarative information extraction rules. The second phase combines the information extraction rules generated at the first phase with the XWRAP component library to construct an executable wrapper program for the given web source. We report the initial experiments on performance of the XWRAP code generation system and the wrapper programs generated by XWRAP.
Web data management, Information extraction, Wrapper generation system, XML
C. Pu, L. Liu and W. Han, "XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources," Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073)(ICDE), San Diego, California, 2000, pp. 611.