This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Scalable Hybrid Approach for Extracting Head Components from Web Tables
February 2006 (vol. 18 no. 2)
pp. 174-187
We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.

[1] H.H. Chen, S.C. Tsai, and J.H. Tsai, “Mining Tables from Large Scale HTML Texts,” Proc. 18th Int'l Conf. Computational Linguistics, July 2000.
[2] M. Hurst, “Layout and Language: Beyond Simple Text for Information Interaction— Modeling the Table,” Proc. Second Int'l Conf. Multimodal Interfaces, 1999.
[3] S.W. Jung, K.H. Sung, T.W. Park, and H.C. Kwon, “Effective Retrieval of Information in Tables on the Internet,” IEA/AIE (LNAI 2358), pp. 493-501, June 2002.
[4] G. Ning, W. Guowen, W. Xiaoyuan, and S. Baile, “Extracting Web Table Information in Cooperative Learning Activities Based on Abstract Semantic Model,” Proc. Sixth Int'l Conf. Computer Supported Cooperative Work in Design, pp. 492-497, 2001.
[5] Y. Wang and J. Hu, “A Machine Learning Based Approach for Table Detection on the Web,” Proc. 11th Int'l World Wide Web Conf. WWW 2002, pp. 7-11, 2002.
[6] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, 2000.
[7] Y. Yang, “Web Table Mining and Database Discovery,” MSc thesis, Simon Fraser Univ., Aug. 2002.
[8] http://www.cs.waikato.ac.nz/~nzdltextmining /, 2005.
[9] L. Eikvil, “Information Extraction from World Wide Web,” Technical Report 945, Norwegian Computing Center, 1999.
[10] S. Soderland, “Learning to Extract Text-Based Information from the World Wide Web,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD), Aug. 1997.

Index Terms:
Index Terms- Text mining, information extraction, table mining.
Citation:
Sung-Won Jung, Hyuk-Chul Kwon, "A Scalable Hybrid Approach for Extracting Head Components from Web Tables," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 2, pp. 174-187, Feb. 2006, doi:10.1109/TKDE.2006.19
Usage of this product signifies your acceptance of the Terms of Use.