This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Logical Structure Analysis and Generation for Structured Documents: A Syntactic Approach
September/October 2003 (vol. 15 no. 5)
pp. 1277-1294
Kyong-Ho Lee, IEEE Computer Society
Yoon-Chul Choy, IEEE Computer Society

Abstract—This paper presents a syntactic method for sophisticated logical structure analysis that transforms document images with multiple pages and hierarchical structure into an electronic document based on SGML/XML. To produce a logical structure more accurately and quickly than previous works of which the basic units are text lines, the proposed parsing method takes text regions with hierarchical structure as input. Furthermore, we define a document model that is able to describe geometric characteristics and logical structure information of documents efficiently and present its automated creation method. Experimental results with 372 images scanned from the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) show that the method has performed logical structure analysis successfully and generated a document model automatically. Particularly, the method generates SGML/XML documents as the result of structural analysis, so that it enhances the reusability of documents and independence of platform.

[1] L. O'Gorman and R. Kasturi, Document Image Analysis. IEEE CS Press, 1995.
[2] G. Nagy, “Twenty Years of Document Image Analysis in PAMI,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 38-62, Jan. 2000.
[3] International Organization for Standardization, Information Processing Text and Office Systems Standard Generalized Markup Language (SGML), ISO/IEC 8879, 1986.
[4] World Wide Web Consortium, Extensible Markup Language (XML) 1.0, (second ed.),http://www.w3c.org/TRREC-xml, 2000.
[5] K.M. Summers, Toward a Taxonomy of Logical Document Structures Proc. Dartmouth Inst. for Advanced Graduate Studies (DAGS '95) pp. 124-133, May 1995.
[6] G. Nagy, S. Seth, and M. Viswanathan, “A Prototype Document Image Analysis System for Technical Journals,” Computer, vol. 25, no. 7, pp. 10-22, July 1992.
[7] M. Krishnamoorthy, G. Nagy, S. Seth, and M. Viswanathan, “Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 7, pp. 737-747, July 1993.
[8] G. Nagy, J. Kanai, M. Krishnamoorthy, M. Thomas, and M. Viswanathan, “Two Complementary Techniques for Digitized Document Analysis,” Proc. ACM Conf. Document Processing Systems, pp. 169-176, Dec. 1988.
[9] D. Niyogi and S.N. Srihari, An Integrated Approach to Document Decomposition and Structural Analysis Int'l J. Imaging Systems and Technology, vol. 7, pp. 330-342, 1996.
[10] G.S.D. Farrow, C.S. Xydears, J.P. Oakley, A. Khorabi, and N.G. Prelcic, A Comparison of System Architectures for Intelligent Document Understanding Signal Processing: Image Comm., vol. 9, pp. 1-19, 1996.
[11] S. Tsujimoto and H. Asada, "Major Components of a Complete Text Reading System," Proceedings IEEE, vol. 80, no. 7, pp. 1,133-1,149, July 1992.
[12] A. Dengel and G. Barth, High Level Document Analysis Guided by Geometric Aspects Int'l J. Pattern Recognition and Artificial Intelligence, vol. 2, no. 4, pp. 641-655, 1988.
[13] A. Dengel, R. Bleisinger, R. Hoch, F. Fein, and F. Hönes, “From Paper to Office Document Standard Representation,” Computer, vol. 25, no. 7, pp. 63-67, July 1992.
[14] International Organization for Standardization, Information Technology Text and Office Systems Document Style Semantics and Specification Language (DSSSL), ISO/IEC 10179, 1996.
[15] G.A. Story et al., "The RightPages Image-Based Electronic Library for Alerting and Browsing," Computer, Sept. 1992, pp. 17-26.
[16] T. Hu and R. Ingold, A Mixed Approach toward an Efficient Logical Structure Recognition from Document Images Electronic Publishing: Origination, Dissemination, and Design, vol. 6, no. 4, pp. 457-468, 1993.
[17] A. Conway, Page Grammars and Page Parsing: A Syntactic Approach to Document Layout Recognition Proc. Second Int'l Conf. Document Analysis and Recognition, pp. 761-764, 1993.
[18] Y. Tateisi and N. Itoh, Using Stochastic Syntactic Analysis for Extraction a Logical Structure from a Document Image Proc. Int'l Conf. Pattern Recognition, vol. 2, pp. 391-394, Oct. 1994.
[19] B. Klein and P. Fankhauser, Error Tolerant Document Structure Analysis Proc. IEEE Int'l Forum on Research and Technology on Advances in Digital Libraries, pp. 116-127, 1997.
[20] B. Klein and A. Abecker, Distributed Knowledge-Based Parsing for Document Analysis and Understanding Proc. IEEE Int'l Forum on Research and Technology on Advances in Digital Libraries, pp. 6-15, May 1999.
[21] T. Gottke and P. Fankhauser, Dream 2.0 User Manual technical report, Gesellschaft fur Mathematik und Datenverarbeitung (GMD), Germany, 1992.
[22] P. Fankhauser and Y. Xu, MarkItUp! An Incremental Approach to Document Structure Recognition Electronic Publishing: Origination, Dissemination, and Design, vol. 6, no. 4, pp. 447-456, 1993.
[23] A. Dengel, "Initial Learning of Document Structure," Proc. Second Int'l Conf. Document Analysis and Recognition, pp. 86-90,Tsukuba, Japan, 1993.
[24] A. Dengel and F. Dubiel, Computer Understanding of Document Structure Int'l J. Imaging Systems and Technology, vol. 7, no. 4, pp. 271-278, 1996.
[25] D. Rus and K.M. Summers, Geometric Algorithms and Experiments for Automated Document Structuring Mathematical and Computer Modelling, vol. 26, no. 1, pp. 55-83, 1997.
[26] K.M. Summers, Automatic Discovery of Logical Document Structure PhD thesis, Cornell Univ., Aug. 1998.
[27] O. Hitz, L. Robadey, and R. Ingold, “Analysis of Synthetic Document Images,” Proc. Fifth Int'l Conf. Document Analysis and Recognition, pp. 374-377, Sept. 1999.
[28] R. Brugger, F. Bapst, and R. Ingold, A DTD Extension for Document Structure Recognition Proc Seventh Int'l Conf. Electronic Publishing, pp. 343-354, 1998.
[29] C. Lin, Y. Niwa, and S. Narita, Logical Structure Analysis of Book Document Image Using Contents Information Proc. Fourth Int'l Conf. Document Analysis and Recognition, vol. II, pp. 1048-1051, 1997.
[30] T. Kochi and T. Saitoh, A Layout-Free Method for Element Extraction from Document Images Proc. Workshop Document Analysis System, pp. 336-345, 1998.
[31] T.A. Bayer and H. Walischewski, Experiments on Extracting Structural Information from Paper Documents Using Syntactic Pattern Analysis Proc. Third Int'l Conf. Document Analysis and Recognition, pp. 476-479, 1995.
[32] M. Worring and A.W.M. Smeulders, Content-Based Internet Access to Paper Documents Int'l J. Document Analysis and Recognition, vol. 1, no. 4, pp. 209-220, 1999.
[33] A.V. Aho, R. Sethi, and J.D. Ullman, Compilers, Principles, Techniques and Tools.New York: Addison-Wesley, 1985.
[34] A. Bruggemann-Klein, Regular Expressions into Finite Automata Theoretical Computer Science, vol. 120, no. 2, pp. 197-213, Nov. 1993.
[35] K. Koffka, Principles of Gestalt Psychology. New York: Brace and World, 1935.
[36] C.F. Goldfarb and P. Prescod, The XML Handbook. Upper Saddle River, N.J.: Prentice Hall, 1998.
[37] A.R.R. Wang, Algorithms for Multi-Level Logic Optimization PhD thesis, The Univ. of California, Berkeley, 1989.
[38] K.H. Lee, Y.C. Choy, and S.B. Cho, Geometric Structure Analysis of Document Images: A Knowledge-Based Approach IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp. 1224-1240, Nov. 2000.
[39] K. Zhang and D. Shasha, "Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems," Siam J. Computing, vol. 18, no. 6, pp. 1,245-1,262, 1989.

Index Terms:
Logical structure analysis, document image understanding, structured documents, SGML, XML, a syntactic method.
Citation:
Kyong-Ho Lee, Yoon-Chul Choy, Sung-Bae Cho, "Logical Structure Analysis and Generation for Structured Documents: A Syntactic Approach," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 5, pp. 1277-1294, Sept.-Oct. 2003, doi:10.1109/TKDE.2003.1232278
Usage of this product signifies your acceptance of the Terms of Use.