• Publication
  • 2005
  • Issue No. 12 - December
  • Abstract - STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques
 This Article 
 Bibliographic References 
 Add to: 
STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques
December 2005 (vol. 17 no. 12)
pp. 1638-1652
A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of "human browsing.” The World Wide Web is today the main "all kind of information” repository and has been so far very successful in disseminating information to humans. By automating the process of information retrieval, further utilization by targeted applications is enabled. The key idea in our novel system is to exploit the format of the Web pages to discover the underlying structure in order to finally infer and extract pieces of information from the Web page. Our system first identifies the section of the Web page that contains the information to be extracted and then extracts it by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach. Experimental results and comparisons with other state of the art systems are presented and discussed in the paper, indicating the high performance of the proposed algorithm.

[1] N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness,” Artificial Intelligence, vol. 118, nos. 1-2, pp. 15-68, 2000.
[2] J. Han and K.C.-C. Chang, “Data Mining for Web Intelligence,” Computer, Nov. 2002.
[3] G. Huck, P. Fankhauser, K. Aberer, and E.J. Neuhold, “Jedi: Extracting and Synthesizing Information from the Web,” Proc. Conf. Cooperative Information Systems, pp. 32-43, 1998.
[4] O. Etzioni, “The World-Wide Web: Quagmire or Gold Mine?,” Comm. ACM, vol. 39, no. 11, pp. 65-68, 1996.
[5] N. Ashish and C.A. Knoblock, “Semi-Automatic Wrapper Generation for Internet Information Sources,” Proc. Int'l Conf. Cooperative Information Systems, pp. 160-169, 1997.
[6] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo, “Extracting Semistructured Information from the Web,” Proc. Workshop Management of Semistructured Data, pp. 136-144, May 1997.
[7] M. Christoffel, B. Schmitt, and J. Schneider, Semi-Automatic Wrapper Generation and Adaption: Living with Heterogeneity in a Market Environment, pp. 60-67. Kluwer Academic Publishers, 2003.
[8] N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper Induction for Information Extraction,” Proc. Int'l Joint Conf. Artificial Intelligence (IJCAI-97), 1997.
[9] R. Baumgartner, S. Flesca, and G. Gottlob, “Visual Web Information Extraction with Lixto,” The VLDB J., pp. 119-128, 2001.
[10] B. Adelberg, “Nodose— A Tool for Semi-Automatically Extracting Structured and Semi-Structured Data from Text Documents,” Proc. ACM SIGMOD Conf., pp. 283-294, 1998.
[11] S. Soderland, “Learning to Extract Text-Based Information from the World Wide Web,” Proc. Knowledge Discovery and Data Mining, pp. 251-254, 1997.
[12] J.-R. Gruser, L. Raschid, M.E. Vidal, and L. Bright, “Wrapper Generation for Web Accessible Data Sources,” Proc. Conf. Cooperative Information Systems, pp. 14-23, 1998.
[13] A. Sahuguet and F. Azavant, “Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F,” The VLDB J., pp. 738-741, 1999.
[14] W. Han, D. Buttler, and C. Pu, “Wrapping Web Data into XML,” SIGMOD Record, vol. 30, no. 3, pp. 33-38, 2001.
[15] D.W. Embley, Y. Jiang, and Y.-K. Ng, “Record-Boundary Discovery in Web-Documents,” Proc. 1999 ACM SIGMOD Conf., pp. 467-478, 1999.
[16] R. Doorenbos, O. Etzioni, and D. Weld, “A Scalable Comparison Shopping Agent for the World Wide Web,” Proc. First Int'l Conf. Autonomous Agents, 1997.
[17] D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. 2001 Int'l Conf. Distributed Computing Systems (ICDCS '01), pp. 361-370, 2001.
[18] B. Liu, R. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 601-606, 2003.
[19] J. He, A. Tan, C. Tan, and S. Sung, “On Quantitative Evaluation of Clustering Systems,” Information Retrieval and Clustering, W. Wu and H. Xiong, eds., Kluwer Academic Publishers, 2002.
[20] E. Gokcay and J.C. Principe, “A New Clustering Evaluation Function Using Renyi's Information Potential,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP), 2000.
[21] M. Halkidi, M. Vazirgiannis, and I. Batistakis, “Quality Scheme Assessment in the Clustering Process,” Proc. Fourth European Conf. Principles of Data Mining and Knowledge Discovery (PKDD), pp. 265-276, 2000.
[22] W. Oliver, “A NovelInteractive Musical Interface,” MS thesis, EECS Dept., June 1997, http://feynman.stanford.edu/people/Oliver_www/ singhtmlmain.html.
[23] E. Metois, “Musical Sound Information: Musical Gesture and Embedding Synthesis (Psymbesis),” Thesis for MIT Media Laboratory, Oct. 1996.
[24] S. Haykin, Adaptive Filter Theory. Prentice Hall, 1996.
[25] Tidy, http:/tidy.sourceforge.net/, 2005.
[26] MEMPHIS Technical A nnex, Project Reference IST-2000- 25045, http:/www.ist-memphis.org/, 2005.
[27] V.V. Raghavan, G.S. Wang, and P. Bollmann, “A Critical Investigation of Recall and Precision as Measures of Retrieval System Performance,” ACM Trans. Information Systems, vol. 7, no. 3, pp. 205-229, 1989.

Index Terms:
Index Terms- Automatic wrappers, generic wrappers, data source wrappers, Web mining, Web data extraction, Web structure mining, intelligent agents on the Web, resource discovery, information retrieval.
Nikolaos K. Papadakis, Dimitrios Skoutas, Konstantinos Raftopoulos, Theodora A. Varvarigou, "STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 12, pp. 1638-1652, Dec. 2005, doi:10.1109/TKDE.2005.203
Usage of this product signifies your acceptance of the Terms of Use.