The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - Jan. (2014 vol.26)
pp: 208-220
Alberto Bartoli , University of Trieste, Trieste
Giorgio Davanzo , University of Trieste, Trieste
Eric Medvet , University of Trieste, Trieste
Enrico Sorio , University of Trieste, Trieste
ABSTRACT
Information extraction from printed documents is still a crucial problem in many interorganizational workflows. Solutions for other application domains, for example, the web, do not fit this peculiar scenario well, as printed documents do not carry any explicit structural or syntactical description. Moreover, printed documents usually lack any explicit indication about their source. We present a system, which we call PATO, for extracting predefined items from printed documents in a dynamic multisource scenario. PATO selects the source-specific wrapper required by each document, determines whether no suitable wrapper exists, and generates one when necessary. PATO assumes that the need for new source-specific wrappers is a part of normal system operation: new wrappers are generated online based on a few point-and-click operations performed by a human operator on a GUI. The role of operators is an integral part of the design and PATO may be configured to accommodate a broad range of automation levels. We show that PATO exhibits very good performance on a challenging data set composed of more than 600 printed documents drawn from three different application domains: invoices, datasheets of electronic components, and patents. We also perform an extensive analysis of the crucial tradeoff between accuracy and automation level.
INDEX TERMS
Information retrieval, Data mining, Humans, Automation, Accuracy, Graphical user interfaces, Patents,data entry, Document management, administrative data processing, business process automation, retrieval models, human-computer interaction
CITATION
Alberto Bartoli, Giorgio Davanzo, Eric Medvet, Enrico Sorio, "Semisupervised Wrapper Choice and Generation for Print-Oriented Documents", IEEE Transactions on Knowledge & Data Engineering, vol.26, no. 1, pp. 208-220, Jan. 2014, doi:10.1109/TKDE.2012.254
REFERENCES
[1] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, "Fully Automatic Wrapper Generation for Search Engines," Proc. 14th Int'l Conf. World Wide Web (WWW '05), p. 66, 2005.
[2] H. He, W. Meng, C. Yu, and Z. Wu, "Automatic Integration of Web Search Interfaces with WISE-Integrator," The VLDB J., vol. 13, no. 3, pp. 1-29, 2004.
[3] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti, "Wrapper Generation for Overlapping Web Sources," Proc. IEEE/WIC/ACM Int'l Conf. Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 32-35, Aug. 2011.
[4] S.L. Chuang, K.C.C. Chang, and C.X. Zhai, "Context-Aware Wrapping: Synchronized Data Extraction," Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB '07), pp. 699-710, 2007.
[5] W. Liu, X. Meng, and W. Meng, "Vide: A Vision-Based Approach for Deep Web Data Extraction," IEEE Trans. Knowledge and Data Eng., vol. 22, no. 3, pp. 447-460, Mar. 2010.
[6] C.H. Chang, M. Kayed, R. Girgis, and K.F. Shaalan, "A Survey of Web Information Extraction Systems," IEEE Trans. Knowledge and Data Eng., vol. 18, no. 10, pp. 1411-1428, Oct. 2006.
[7] E. Ferrara, G. Fiumara, and R. Baumgartner, "Web Data Extraction, Applications and Techniques : A Survey," ACM Computing Surveys, vol. 5, pp. 1-20, June 2010.
[8] E. Medvet, A. Bartoli, and G. Davanzo, "A Probabilistic Approach to Printed Document Understanding," Int'l J. Document Analysis and Recognition, vol. 14, pp. 335-347, 2011.
[9] M.J. Cafarella, A. Halevy, and N. Khoussainova, "Data Integration for the Relational Web," Proc. VLDB Endowment, vol. 2, no. 1, pp. 1090-1101, Aug. 2009.
[10] E. Sorio, A. Bartoli, G. Davanzo, and E. Medvet, "Open World Classification of Printed Invoices," Proc. 10th ACM Symp. Document Eng. (DocEng '10), pp. 187-190, 2010.
[11] R. Khare, Y. An, and I.-Y. Song, "Understanding Deep Web Search Interfaces: A Survey," ACM SIGMOD Record, vol. 39, no. 1, pp. 33-40, 2010.
[12] L. Barbosa, J. Freire, and A. Silva, "Organizing Hidden-Web Databases by Clustering Visible Web Documents," Proc. IEEE 23rd Int'l Conf. Data Eng., pp. 326-335, 2007.
[13] H. Elmeleegy, J. Madhavan, and A. Halevy, "Harvesting Relational Tables from Lists on the Web," Proc. VLDB Endowment, vol. 2, no. 1, pp. 209-226, 2009.
[14] Y. Belaid and A. Belaid, "Morphological Tagging Approach in Document Analysis of Invoices," Proc. 17th Int'l Conf. the Pattern Recognition (ICPR '04), pp. 469-472, 2004.
[15] F. Cesarini, E. Francesconi, M. Gori, and G. Soda, "Analysis and Understanding of Multi-Class Invoices," Int'l J. Document Analysis and Recognition, vol. 6, no. 2, pp. 102-114, Oct. 2003.
[16] A. Amano, N. Asada, M. Mukunoki, and M. Aoyama, "Table form Document Analysis Based on the Document Structure Grammar," Int'l J. Document Analysis and Recognition, vol. 8, no. 2, pp. 201-213, June 2006.
[17] M. Aiello, C. Monz, L. Todoran, and M. Worring, "Document Understanding for a Broad Class of Documents," Int'l J. Document Analysis and Recognition, vol. 5, no. 1, pp. 1-16, Nov. 2002.
[18] H. Sako, M. Seki, N. Furukawa, H. Ikeda, and A. Imaizumi, "Form Reading Based on Form-Type Identification and Form-Data Recognition," Proc. Seventh Int'l Conf. Document Analysis and Recognition, vol. 2, pp. 926-930, 2003.
[19] Y. Navon, E. Barkan, and B. Ophir, "A Generic form Processing Approach for Large Variant Templates," Proc. 10th Int'l Conf. Document Analysis and Recognition (ICDAR '09), pp. 311-315, 2009.
[20] N. Chen and D. Blostein, "A Survey of Document Image Classification: Problem Statement, Classifier Architecture and Performance Evaluation," Int'l J. Document Analysis and Recognition, vol. 10, no. 1, pp. 1-16, June 2007.
[21] I. Ahmadullin, J. Allebach, N. Damera-Venkata, J. Fan, S. Lee, Q. Lin, J. Liu, and E. O'Brien-Strain, "Document Visual Similarity Measure for Document Search," Proc. 11th ACM Symp. Document Eng. (DocEng '11), pp. 139-142, 2011.
[22] C. Alippi, F. Pessina, and M. Roveri, "An Adaptive System for Automatic Invoice-Documents Classification," Proc. IEEE Int'l Conf. Image Processing (ICIP '05), vol. 2, 2005.
[23] H. Peng, F. Long, and Z. Chi, "Document Image Recognition Based on Template Matching of Component Block Projections," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1188-1192, Sept. 2003.
[24] H. Hamza, Y. Belaid, A. Belaid, and B. Chaudhuri, "Incremental Classification of Invoice Documents," Proc. 19th Int'l Conf. Pattern Recognition (ICPR '08), pp. 1-4, 2008.
[25] E. Oro and M. Ruffolo, "XONTO: An Ontology-Based System for Semantic Information Extraction from PDF Documents," Proc. IEEE 20th Int'l Conf. Tools with Artificial Intelligence, vol. 1, pp. 118-125, Nov. 2008.
[26] S. Flesca, E. Masciari, and A. Tagarelli, "A Fuzzy Logic Approach to Wrapping Pdf Documents," IEEE Trans. Knowledge and Data Eng., vol. 23, no. 12, pp. 1826-1841, Dec. 2011.
[27] T. Hassan, "User-Guided Wrapping of Pdf Documents Using Graph Matching Techniques," Proc. 10th Int'l Conf. Document Analysis and Recognition, pp. 631-635, 2009.
[28] E. Sorio, A. Bartoli, G. Davanzo, and E. Medvet, "A Domain Knowledge-Based Approach for Automatic Correction of Printed Invoices," Proc. IEEE Int'l Conf. Information Soc. (iSoc. '12), 2012.
[29] T.-F. Wu, C.-J. Lin, and R.C. Weng, "Probability Estimates for Multi-Class Classification by Pairwise Coupling," J. Machine Learning Research, vol. 5, pp. 975-1005, Dec. 2004.
[30] D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, "Building a Test Collection for Complex Document Information Processing," Proc. 29th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 665-666, 2006.
[31] B. Klein, S. Agne, and A. Dengel, "Results of a Study on Invoice-Reading Systems in Germany," Proc. IAPR Int'l Workshop Document Analysis Systems 6, pp. 451-462, 2004.
[32] F. Schulz, M. Ebbecke, M. Gillmann, B. Adrian, S. Agne, and A. Dengel, "Seizing the Treasure: Transferring Knowledge in Invoice Analysis," Proc. 10th Int'l Conf. Document Analysis and Recognition, pp. 848-852, 2009.
[33] H. Hamza, Y. Belaïd, and A. Belaïd, "Case-Based Reasoning for Invoice Analysis and Recognition," Proc. Seventh Int'l Conf. Case-Based Reasoning: Case-Based Reasoning Research and Development, pp. 404-418, 2007.
133 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool