Issue No. 06 - November/December (2004 vol. 19)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIS.2004.68
Bing Liu , University of Illinois at Chicago
Yanhong Zhai , University of Illinois at Chicago
Robert Grossman , University of Illinois at Chicago
Much information on the Web is contained in regularly structured objects, or data records. Data records often present their host pages' essential information, such as lists of products and services. Mining data records to extract this information can help you provide value-added services. Existing approaches to data extraction on the Web include supervised learning and automatic techniques. Supervised learning requires substantial human effort, and current automatic techniques provide poor results. To solve this problem, the MDR (mining data records) system exploits two key observations about the layout of data records in Web pages and employs a string-matching algorithm. Experiments show that this new automatic technique significantly outperforms existing methods. In addition, it mines both contiguous and noncontiguous data records.
data mining, Web mining, Web data extraction, Web data, databases
Bing Liu, Yanhong Zhai, Robert Grossman, "Mining Web Pages for Data Records", IEEE Intelligent Systems, vol. 19, no. , pp. 49-55, November/December 2004, doi:10.1109/MIS.2004.68