Issue No. 06 - November/December (2004 vol. 19)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIS.2004.71
Chia-Hui Chang , National Central University, Taiwan
Shih-Chien Kuo , Trend Micro, Taiwan
Extracting information from semistructured Web documents is an important task for many information agents. Over the past few years, researchers have developed an extensive family of generic information extraction techniques based on supervised approaches that learn extraction rules from user-labeled training examples. However, annotating training data can be expensive when thousands of data sources must be wrapped. OLERA, a semisupervised IE system, produces extraction rules without detailed annotation of the training documents. Instead, it gives a rough segment that contains all that need to be extracted in one record as an example. OLERA is designed with visualization support such that it displays the discovered records in a spreadsheet-like table for schema assignment. Experiments show that OLERA performs well for program-generated Web pages with very few training pages and little user intervention.
semistructured data, Web data extraction, multiple string alignment, rule generalization
S. Kuo and C. Chang, "OLERA: Semisupervised Web-Data Extraction with Visual Support," in IEEE Intelligent Systems, vol. 19, no. , pp. 56-64, 2004.