This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Prototype Extraction and Adaptive OCR
December 1999 (vol. 21 no. 12)
pp. 1280-1296

Abstract—To maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a transcript produced automatically by an imperfect omnifont OCR system can be used. The method is based on new algorithms for estimating character widths, character locations in a word, and match/nonmatch probabilities from unsegmented text. An experimental word recognition system is designed and developed to combine prototype extraction algorithms and segmentation-free word recognition. The system can adapt itself to different page images and achieve high recognition accuracy on heavily degraded print.

[1] H.S. Baird, “The Skew Angle of Printed Documents,” Proc. Conf. Photographic Scientists and Engineers, pp. 14-21, 1987.
[2] H.S. Baird and G. Nagy, “A Self-Correcting 100-Font Classifier,” Proc. SPIE, vol. 2,181, pp. 106-115, 1994.
[3] R. Duda and P.E. Hart, Pattern Classification and Scene Analysis. John Wiley&Sons, 1973.
[4] R. Esposito, D. Malerba, and G. Semeraro, “An Experimental Page Layout Recognition System for Office Document Automatic Classification: An Integrated Approach for Inductive Generalization,” Proc. 10th Int'l Conf. Pattern Recognition (ICPR), pp. 557-562, 1990.
[5] A. El-Nasan, “InkLink—An Unconstrainted-Handwriting Recognition Engine,” technical report, DocLab, RPI, 1998.
[6] H.S. Heaps, Information Retrieval: Computational and Theoretical Aspects. New York: Academic Press, 1978.
[7] T. Hong and J. Hull, “Character Segmentation Using Visual Inter-Word Constraints in a Text Page,” SPIE, vol. 2,422, pp. 15-25, 1995.
[8] R. Ingold, “Structure de Documents et Lecture Optique: Une Nouvell Approche,” doctoral dissertation, Ecole Polytechnique Federale de Lausanne, Presses Polytechniques Romandes, Lausanne, Switzerland, 1994.
[9] G.E. Kopec and P.A. Chou, “Document Image Decoding Using Markov Source Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 6, pp. 602-617, June 1994.
[10] G.E. Kopec and M. Lomelin, “Document-Specific Character Template Estimation,” Proc. SPIE, vol. 2660, pp. 14-26, 1996.
[11] G.E. Kopec, “Least-Squares Font Metric Estimation from Images,” IEEE Trans. Image Processing, vol. 2, no. 4, pp. 510-519, 1993.
[12] H. Kucera and W.N. Francis, Computational Analysis of Present-Day American English. Brown Univ. Press, 1967.
[13] C.L. Lawson and R.J. Hanson, Solving Least Squares Problems. Prentice Hall, 1974.
[14] G. Nagy and G.L. Shelton, “Self-Corrective Character Recognition System,” IEEE Trans. Information Theory, vol. 12, no. 2, pp. 215-222, Apr. 1966.
[15] G. Nagy and Y. Xu, “Priming the Recognizer,” Proc. IAPR Workshop Document Analysis Systems, pp. 263-281, 1996.
[16] G. Nagy and Y. Xu, “Automatic Prototype Extraction for Adaptive OCR,” Proc. Fourth Int'l Conf. Document Analysis and Recognition, pp. 278-282, 1997.
[17] G. Nagy and Y. Xu, “Bayesian Subsequence Matching and Segmentation,” Pattern Recognition Letters, vol. 18, pp. 1,117-1,124, 1997.
[18] S.V. Rice, F.R. Jenkins, and T.A. Nartker, “The Fifth Annual Test of OCR Accuracy,” UNLV Information Science Research Inst. 1996 Ann. Report, 1993-1996.
[19] T. Sziranyi and A. Boroczki, “Overall Picture Degradation Error for Scanned Images and the Efficiency of Character Recognition,” Optical Eng., vol. 30, no. 12, pp. 1,878-1,885, 1991.
[20] L. Spitz, “An OCR Based on Character Shape Codes and Lexical Information,” Proc. Third Int'l Conf. Document Analysis and Recognition, pp. 723-728, 1995.
[21] R. Valdes, “Finding String Distances,” Dr. Dobb's J., pp. 56-62, Apr. 1992.
[22] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.

Index Terms:
Optical character recognition, adaptive classification, template matching, segmentation, document image analysis, text reader.
Citation:
Yihong Xu, George Nagy, "Prototype Extraction and Adaptive OCR," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 12, pp. 1280-1296, Dec. 1999, doi:10.1109/34.817408
Usage of this product signifies your acceptance of the Terms of Use.