The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - February (2006 vol.18)
pp: 174-187
ABSTRACT
We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.
INDEX TERMS
Index Terms- Text mining, information extraction, table mining.
CITATION
Sung-Won Jung, Hyuk-Chul Kwon, "A Scalable Hybrid Approach for Extracting Head Components from Web Tables", IEEE Transactions on Knowledge & Data Engineering, vol.18, no. 2, pp. 174-187, February 2006, doi:10.1109/TKDE.2006.19
21 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool