This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Optimization Methodology for Document Structure Extraction on Latin Character Documents
July 2001 (vol. 23 no. 7)
pp. 719-734

Abstract—In this paper, we give a formal definition of a document image structure representation and we formulate document image structure extraction as a partitioning problem: Finding an optimal solution partitioning the set of glyphs of an input document image into a hierarchical tree structure where entities within the hierarchy at each level have similar physical properties and compatable semantic labels. We present a unified methodology that is applicable to construction of document structures at different hierarchical levels. An iterative, relaxation-like method is used to find a partitioning solution that maximizes the probability of the extracted structure. All the probabilities used in the partioning process are estimated from an extensive training set of various kinds of measurements among the entities within the hierarchy. The offline probabilities estimated in the training then drive all decisions in the online document structure extraction. We have implemented a text line extraction algorithm using this framework. The algorithm was evaluated on the UW-III database of some 1,600 scanned document image pages. An area-overlap measure is used to find the correspondence between the detected entities and the ground-truth. For a total of 105,020 text lines, the text line extraction algorithm identifies and segments 104,773 correctly, an accuracy of 99.76 percent. The detail of the algorithm is presented in this paper.

[1] R.M. Haralick, “Document Image Understanding: Geometric and Logical Layout,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 385-390, 1994.
[2] G. Nagy and S. Seth, “Hierarchical Representation of Optically Scanned Documents,” Proc. Seventh Int'l Conf. Pattern Recongnition, pp 347-349, 1984.
[3] J. Ha, R.M. Haralick, and I.T. Phillips, “Document Page Decomposition Using Bounding Boxes of Connected Components of Black Pixels,” Proc. Document Recognition II, L.M. Vincent and H.S. Baird, eds., pp. 140-151, 1995.
[4] J. Liang, J. Ha, R.M. Haralick, and I.T. Phillips, “Document Layout Structure Extraction Using Bounding Boxes of Different Entities,” Proc. Third IEEE Workshop Applications of Computer Vision, pp 278-283, 1996.
[5] H. Baird and D. Ittner, "Language-Free Layout Analysis," Proc. Second Int'l Conf. Document Analysis and Recognition,Tsukuba, Japan, pp. 336-340, Oct. 1993.
[6] A. Dengel and F. Dubiel, “Clustering and Classification of Document Structure: A Machine Learning Approach,” Proc. Int'l Conf. Document Analysis and Recognition, pp. 587-591, 1995.
[7] F. Esposito, D. Malerba, and G. Semeraro, "Automated Acquisition of Rules for Document Understanding," Proc. Second Int'l Conf. Document Analysis and Recognition, pp. 650-654,Tsukuba, Japan, 1993.
[8] K. Kise, A. Sato, and M. Iwata, “Segmentation of Page Images Using the Area Voronoi Diagram,” Computer Vision and Image Understanding, vol. 70, pp. 370-382, 1998.
[9] S.Y. Wang and T. Yagasaki, “Block Selection: A Method for Segmenting Page Image of Various Editing Styles,” Proc. Third Int'l Conf. Document Analysis and Recognition, pp. 128-135, 1995.
[10] A.K. Jain and B. Yu, “Document Representation and Its Application to Page Decomposition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 294-308, Mar. 1998.
[11] S. Chen, “Document Layout Analysis Using Recursive Morphological Transforms,” PhD thesis, Univ. of Washington, 1995.
[12] P.W. Palumbo, S.N. Srihari, J. Soh, R. Sridhar, and V. Demjanenko, “Postal Address Block Location in Real Time,” Computer, pp. 34-42, July 1992.
[13] K. Etemad, D. Doerman, and R. Chellappa, “Multiscale Segmentation of Unstructured Document Pages Using Soft Decision Integration,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 1, pp. 92-96, Jan. 1997.
[14] F. Jensen, An Introduction to Bayesian Neworks. Springer Verlag, 1996.
[15] S-PLUS Guide to Statistics. MathSoft, 1997.
[16] J. Kanai, S.V. Rice, T.A. Nartker, and G. Nagy, “Automated Evaluation of OCR Zoning” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 1, pp. 86-89, Jan. 1995.
[17] M.D. Garris, “Evaluating Spatial Correspondence of Zones in Document Recognition Systems,” Proc. Int'l Conf. Image Processing, pp. 304-307, vol. 3, Oct. 1995.
[18] S. Randriamasy and L. Vincent, "Benchmarking Page Segmentation Algorithms," Proc. CVPR, pp. 441-416, 1994.
[19] I.T. Phillips, “User's Reference Manual for the UW English/Technical Document Image Database III, Image Database Manual,” technical document, UW III, 1996.
[20] R.M. Haralick and L.G. Shapiro, Computer and Robot Vision. New York: Addison-Wesley, 1993.
[21] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes in C. Cambridge Univ. Press, 1992.
[22] S. Chen, R.M. Haralick, and I. Phillips, “Automatic Text Skew Estimation in Document Images,” Proc. Third Int'l Conf. Document Analysis and Recognition (ICDAR'95), pp. 1153-1156, Aug. 1995.
[23] I. Phillips, S. Chen, and R. Haralick, “CD-ROM Document Database Standard,” Proc. Second Int'l Conf. Document Analysis and Recognition, pp. 478-483, 1993.
[24] I.T. Phillips, J. Ha, R.M. Haralick, and D. Dori, “The Implementation Methodology for the CD-ROM English Document Database,” Proc. Int'l Conf. Document Analysis and Retrieval, pp. 484-487, Oct. 1993.

Index Terms:
Document image analysis, statistical pattern analysis, text line extraction, performance evaluation.
Citation:
Jisheng Liang, Ihsin T. Phillips, Robert M. Haralick, "An Optimization Methodology for Document Structure Extraction on Latin Character Documents," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 7, pp. 719-734, July 2001, doi:10.1109/34.935846
Usage of this product signifies your acceptance of the Terms of Use.