This Article 
 Bibliographic References 
 Add to: 
Prototype Extraction and Adaptive OCR
December 1999 (vol. 21 no. 12)
pp. 1280-1296

Abstract—To maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a transcript produced automatically by an imperfect omnifont OCR system can be used. The method is based on new algorithms for estimating character widths, character locations in a word, and match/nonmatch probabilities from unsegmented text. An experimental word recognition system is designed and developed to combine prototype extraction algorithms and segmentation-free word recognition. The system can adapt itself to different page images and achieve high recognition accuracy on heavily degraded print.

[1] H.S. Baird, “The Skew Angle of Printed Documents,” Proc. Conf. Photographic Scientists and Engineers, pp. 14-21, 1987.
[2] H.S. Baird and G. Nagy, “A Self-Correcting 100-Font Classifier,” Proc. SPIE, vol. 2,181, pp. 106-115, 1994.
[3] R. Duda and P.E. Hart, Pattern Classification and Scene Analysis. John Wiley&Sons, 1973.
[4] R. Esposito, D. Malerba, and G. Semeraro, “An Experimental Page Layout Recognition System for Office Document Automatic Classification: An Integrated Approach for Inductive Generalization,” Proc. 10th Int'l Conf. Pattern Recognition (ICPR), pp. 557-562, 1990.
[5] A. El-Nasan, “InkLink—An Unconstrainted-Handwriting Recognition Engine,” technical report, DocLab, RPI, 1998.
[6] H.S. Heaps, Information Retrieval: Computational and Theoretical Aspects. New York: Academic Press, 1978.
[7] T. Hong and J. Hull, “Character Segmentation Using Visual Inter-Word Constraints in a Text Page,” SPIE, vol. 2,422, pp. 15-25, 1995.
[8] R. Ingold, “Structure de Documents et Lecture Optique: Une Nouvell Approche,” doctoral dissertation, Ecole Polytechnique Federale de Lausanne, Presses Polytechniques Romandes, Lausanne, Switzerland, 1994.
[9] G.E. Kopec and P.A. Chou, “Document Image Decoding Using Markov Source Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 6, pp. 602-617, June 1994.
[10] G.E. Kopec and M. Lomelin, “Document-Specific Character Template Estimation,” Proc. SPIE, vol. 2660, pp. 14-26, 1996.
[11] G.E. Kopec, “Least-Squares Font Metric Estimation from Images,” IEEE Trans. Image Processing, vol. 2, no. 4, pp. 510-519, 1993.
[12] H. Kucera and W.N. Francis, Computational Analysis of Present-Day American English. Brown Univ. Press, 1967.
[13] C.L. Lawson and R.J. Hanson, Solving Least Squares Problems. Prentice Hall, 1974.
[14] G. Nagy and G.L. Shelton, “Self-Corrective Character Recognition System,” IEEE Trans. Information Theory, vol. 12, no. 2, pp. 215-222, Apr. 1966.
[15] G. Nagy and Y. Xu, “Priming the Recognizer,” Proc. IAPR Workshop Document Analysis Systems, pp. 263-281, 1996.
[16] G. Nagy and Y. Xu, “Automatic Prototype Extraction for Adaptive OCR,” Proc. Fourth Int'l Conf. Document Analysis and Recognition, pp. 278-282, 1997.
[17] G. Nagy and Y. Xu, “Bayesian Subsequence Matching and Segmentation,” Pattern Recognition Letters, vol. 18, pp. 1,117-1,124, 1997.
[18] S.V. Rice, F.R. Jenkins, and T.A. Nartker, “The Fifth Annual Test of OCR Accuracy,” UNLV Information Science Research Inst. 1996 Ann. Report, 1993-1996.
[19] T. Sziranyi and A. Boroczki, “Overall Picture Degradation Error for Scanned Images and the Efficiency of Character Recognition,” Optical Eng., vol. 30, no. 12, pp. 1,878-1,885, 1991.
[20] L. Spitz, “An OCR Based on Character Shape Codes and Lexical Information,” Proc. Third Int'l Conf. Document Analysis and Recognition, pp. 723-728, 1995.
[21] R. Valdes, “Finding String Distances,” Dr. Dobb's J., pp. 56-62, Apr. 1992.
[22] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.

Index Terms:
Optical character recognition, adaptive classification, template matching, segmentation, document image analysis, text reader.
Yihong Xu, George Nagy, "Prototype Extraction and Adaptive OCR," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 12, pp. 1280-1296, Dec. 1999, doi:10.1109/34.817408
Usage of this product signifies your acceptance of the Terms of Use.