OLERA: Semisupervised Web-Data Extraction with Visual Support
November/December 2004 (vol. 19 no. 6)
pp. 56-64
DOI Bookmark:
http://doi.ieeecomputersociety.org/10.1109/MIS.2004.71
Extracting information from semistructured Web documents is an important task for many information agents. Over the past few years, researchers have developed an extensive family of generic information extraction techniques based on supervised approaches that learn extraction rules from user-labeled training examples. However, annotating training data can be expensive when thousands of data sources must be wrapped. OLERA, a semisupervised IE system, produces extraction rules without detailed annotation of the training documents. Instead, it gives a rough segment that contains all that need to be extracted in one record as an example. OLERA is designed with visualization support such that it displays the discovered records in a spreadsheet-like table for schema assignment. Experiments show that OLERA performs well for program-generated Web pages with very few training pages and little user intervention.
Index Terms:
semistructured data, Web data extraction, multiple string alignment, rule generalization
Citation:
Chia-Hui Chang, Shih-Chien Kuo, "OLERA: Semisupervised Web-Data Extraction with Visual Support," IEEE Intelligent Systems, vol. 19, no. 6, pp. 56-64, Nov./Dec. 2004, doi:10.1109/MIS.2004.71
Usage of this product signifies your acceptance of the
Terms of Use.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||