This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents
February 1999 (vol. 21 no. 2)
pp. 179-183

Abstract—Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming, and (iii) the manual labor required for this task is prohibitively expensive. In this paper we describe a closed-loop methodology for collecting very accurate groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transformation to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for English, Hindi, and FAX document images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document.

[1] R.G. Casey and D.R. Ferguson, "Intelligent Forms Processing," IBM Systems J., vol. 29, no. 3, pp. 435-450, 1990.
[2] D.S. Doermann and A. Rosenfeld, "The Processing of Form Documents," Proc. Second Int'l Conf. Document Anal. Recog, pp. 497-501,Tsukuba, Japan, 1993.
[3] R.M. Haralick et al., "U.W. English Database I," 1994.
[4] J.D. Hobby, "Matching Document Images With Ground Truth," Proc. Int'l Conf. Document Analysis and Recognition, p. 313,Ulm, Germany,18-20 Aug. 1997.
[5] T. Kanungo, "Document Degradation Models and a Methodology for Degradation Model Validation," Univ. of Washington, Seattle, 1996. Seehttp://www.cfar.umd.edu/~kanungo/pubsphdthesis.ps.Z .
[6] T. Kanungo, H.S. Baird, and R.M. Haralick, "Validation and Estimation of Document Degradation Models," Proc. Fourth Annual Symp. Document Analysis and Information Retrieval,Las Vegas, Nev., Apr. 1995.
[7] T. Kanungo and R.M. Haralick, "Automatic Generation of Character Groundtruth for Scanned Documents: A Closed Loop Approach," Proc. IAPR Int'l Conf. Pattern Recognition, pp. 669-675,Vienna, Aug.25-30 1996.
[8] T. Kanungo and R.M. Haralick, "An Automatic Closed-Loop Methodology for Generating Character Groundtruth," Center for Automation Research, Univ. of Maryland, College Park, no. LAMP-TR-026,CFAR-TR-899, 1998.
[9] T. Kanungo, R.M. Haralick, H.S. Baird, W. Stuetzle, and D. Madigan, "Document Degradation Models: Parameter Estimation and Model Validation," Proc. Int'l Workshop Machine Vision Applications,Kawasaki, Japan,13-15 Dec. 1994.
[10] T. Kanungo, R.M. Haralick, and I. Phillips, "Non-Linear Local and Global Document Degradation Models," Int'l J. Imaging Systems and Technology, vol. 5, no. 4, 1994.
[11] D.E. Knuth, TEX: The Program. Mass.: Addison-Wesley, 1988.
[12] L. Lamport, LATEX: A Document Preparation System, Mass.: Addison-Wesley, 1986.
[13] J. LeMoigne, "Image Registration Workshop," NASA Goddard Space Flight Center, Greenbelt, Md., Nov. 1997.
[14] F.J. Velthuis, "Devanagari Macro for Latex," Univ. of Groningen, The Netherlands. Seeftp.cs.ducke.edu/dist/sourcesdevanagari.tar.Z .
[15] G. Wolberg, Digital Image Warping, IEEE CS Press, 1990.

Index Terms:
Automatic real groundtruth, document image analysis, OCR, performance evaluation, image registration, geometric transformations, image warping.
Citation:
Tapas Kanungo, Robert M. Haralick, "An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 2, pp. 179-183, Feb. 1999, doi:10.1109/34.748827
Usage of this product signifies your acceptance of the Terms of Use.