This Article 
 Bibliographic References 
 Add to: 
A Fuzzy Logic Approach to Wrapping PDF Documents
December 2011 (vol. 23 no. 12)
pp. 1826-1841
Sergio Flesca, University of Calabria, Rende
Elio Masciari, ICAR Institute of National Research Council, Rende
Andrea Tagarelli, University of Calabria, Rende
The PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel bottom-up hierarchical wrapping approach that exploits fuzzy logic to handle the “uncertainty” which is intrinsic to the structure and presentation of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions that impose a target structure to groups of tokens containing the required information. Constraints on token groupings are formulated as fuzzy conditions, which are defined on spatial and content predicates of tokens. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document. The proposed approach has been implemented in a wrapper generation system that offers visual capabilities to assist the designer in specifying and evaluating a PDF wrapper. Experimental results have shown good accuracy and applicability of our system to PDF documents of various domains.

[1] S. Flesca, S. Garruzzo, E. Masciari, and A. Tagarelli, "Wrapping PDF Documents Exploiting Uncertain Knowledge," Proc. Int'l Conf. Advanced Information Systems Eng. (CAiSE '08), pp. 175-189, 2006.
[2] Adobe Systems Incorporated, "PDF Reference, Fifth ed.: Adobe Portable Document Format version 1.6." http://partners.adobe. com/public/developer pdf, 2004.
[3] R. Baumgartner, S. Flesca, and G. Gottlob, "Visual Web Information Extraction with Lixto," Proc. 27th Int'l Conf.Very Large Databases Conf. (VLDB '01), pp. 119-128, 2001.
[4] I. Muslea, S. Minton, and C. Knoblock, "Hierarchical Wrapper Induction for Semistructured Information Sources," Autonomous Agents and Multi-Agent Systems, vol. 4, no. 1/2, pp. 93-114, 2001.
[5] C. Hsu and M. Dung, "Wrapping Semistructured Web Pages with Finite-State Transducers," Proc. Conf. Automatic Learning and Discovery, 1998.
[6] N. Kusmerick, "Wrapper Induction: Efficiency and Expressiveness," Artificial Intelligence J., vol. 118, nos. 1/2, pp. 15-68, 2000.
[7] A.H.F. Laender, B.A. Ribeiro-Neto, and A.S. da Silva, "DEByE—Data Extraction by Example," Data Knowledge and Eng., vol. 40, no. 2, pp. 121-154, 2002.
[8] V. Crescenzi and G. Mecca, "Automatic Information Extraction from Large Websites," J. ACM, vol. 51, no. 5, pp. 731-779, 2004.
[9] J. Turmo, A. Ageno, and N. Català, "Adaptive Information Extraction," ACM Computing Surveys, vol. 38, no. 2, pp. 1-47, 2006.
[10] M. Califf and R. Mooney, "Relational Learning of Pattern-Match Rules for Information Extraction," Proc. 16th Nat'l Conf. Artificial Intelligence and the 11th Conf. Innovative Applications of Artificial Intelligence (AAAI/IAAI '99), pp. 328-334, 1999.
[11] D. Freitag, "Machine Learning for Information Extraction in Informal Domains," Machine Learning, vol. 39, nos. 2/3, pp. 233-272, 2000.
[12] S. Soderland, "Learning Information Extraction Rules for Semistructured and Free Text," Machine Learning, vol. 34, nos. 1-3, pp. 233-272, 1999.
[13] A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira, "A Brief Survey of Web Data Extraction Tools," ACM SIGMOD Record, vol. 31, no. 2, pp. 84-93, 2002.
[14] D. Freitag and N. Kushmerick, "Boosted Wrapper Induction," Proc. 17th Nat'l Conf. Artificial Intelligence and 12th Conf. Innovative Applications of Artificial Intelligence (AAAI/IAAI '00), pp. 577-583, 2000.
[15] L. Liu, C. Pu, and W. Han, "XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources," Proc. 16th Int'l Conf. Data Eng. (ICDE '00), pp. 611-621, 2000.
[16] D. Pinto, A. McCallum, X. Wei, and W.B. Croft, "Table Extraction Using Conditional Random Fields," Proc. 26th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '03), pp. 235-242, 2003.
[17] J.-Y. Ramel, M. Crucianu, N. Vincent, and C. Faure, "Detection, Extraction and Representation of Tables," Proc. Int'l Conf. Document Analysis and Recognition (ICDAR), pp. 374-378, 2003.
[18] Y. Liu, P. Mitra, and C.L. Giles, "Identifying Table Boundaries in Digital Documents via Sparse Line Detection," Proc. 17th Conf. Information and Knowledge Management (CIKM '08), pp. 1311-1320, 2008.
[19] T. Hassan and R. Baumgartner, "Table Recognition and Understanding from PDF Files," Proc. Int'l Conf. Document Analysis and Recognition (ICDAR), pp. 1143-1147, 2007.
[20] B. Yildiz, K. Kaiser, and S. Miksch, "pdf2table: A Method to Extract Table Information from PDF Files," Proc. Indian Int'l Conf. Artificial Intelligence (IICAI), pp. 1773-1785, 2005.
[21] Y. Liu, K. Bai, P. Mitra, and C.L. Giles, "Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines," Proc. Int'l Conf. Document Analysis and Recognition (ICDAR), pp. 1006-1010, 2009.
[22] H. Djean and J.-L. Meunier, "A System for Converting PDF Documents into Structured XML Format," Proc. Int'l Workshop Document Analysis Systems, pp. 129-140, 2006.
[23] M.A. Bhatti and A. Ahmad, "PDF to HTML Conversion: Having a Usable Web Document," Proc. Int'l Conf. Digital Information Management, pp. 289-293, 2006.
[24] F. Yuan, B. Liu, and G. Yu, "A Study on Information Extraction from PDF Files," Proc. Int'l Conf. Advances in Machine Learning and Cybernetics (ICMLC), pp. 258-267, 2005.
[25] T. Hassan and R. Baumgartner, "Intelligent Text Extraction from PDF Documents," Proc. Int'l Conf. Computational Intelligence for Modelling, Control and Automation (CIMCA) and Int'l Conf. Intelligent Agents, Web Technologies and Internet Commerce (IAWTIC), pp. 2-6, 2005.
[26] T. Hassan and R. Baumgartner, "Using Graph Matching Techniques to Wrap Data from PDF Documents," Proc. 15th Int'l Conf. World Wide Web (WWW '06), pp. 901-902, 2006.
[27] T. Hassan, "Use-Guided Wrapping of PDF Documents Using Graph Matching Techniques," Proc. Int'l Conf. Document Analysis and Recognition (ICDAR), pp. 631-635, 2009.
[28] L. Zadeh, "Fuzzy Sets," Information and Control, vol. 8, pp. 338-353, 1965.
[29] M. Wygralak, "Fuzzy Cardinals Based on the Generalized Equality of Fuzzy Subsets," Fuzzy Sets and Systems, vol. 18, pp. 143-158, 1986.
[30] Adobe Systems Incorporated, "Document Management—Portable Document Format—Part 1: PDF 1.7." PDF32000_2008.pdf, 2011.
[31] S. Skiadopoulos and M. Koubarakis, "Composing Cardinal Direction Relations," Artificial Intelligence, vol. 152, no. 2, pp. 143-171, 2004.
[32] R. Goyal and M. Egenhofer, "Similarity of Cardinal Directions," Proc. 7th Int'l Symp. Advances in Spatial and Temporal Databases, pp. 36-58, 2001.
[33] S. Patwardhan, S. Banerjee, and T. Pedersen, "Using Measures of Semantic Relatedness for Word Sense Disambiguation," Proc. Int'l Conf. Intelligent Text Processing and Computational Linguistics (CICLing '03), pp. 241-257, 2003.
[34] A. Bruggemann-Klein and D. Wood, "One-Unambiguous Regular Languages," Information and Computation, vol. 142, no. 2, pp. 182-206, 1998.
[35] N. Chinchor, "MUC-4 Evaluation Metrics," Proc. Message Understanding Conf. (MUC), pp. 22-29, 1992.

Index Terms:
Information extraction, fuzzy logic, wrapping, Adobe PDF, print-oriented documents, PDFWrap system.
Sergio Flesca, Elio Masciari, Andrea Tagarelli, "A Fuzzy Logic Approach to Wrapping PDF Documents," IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 12, pp. 1826-1841, Dec. 2011, doi:10.1109/TKDE.2010.220
Usage of this product signifies your acceptance of the Terms of Use.