The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - Dec. (2012 vol.34)
pp: 2467-2480
Pingping Xiu , Microsoft Advertising R&D, Redmond, WA, USA
H. S. Baird , Dept. of Comput. Sci. & Eng., Lehigh Univ., Bethlehem, PA, USA
ABSTRACT
Whole-book recognition is a document image analysis strategy that operates on the complete set of a book's page images using automatic adaptation to improve accuracy. We describe an algorithm which expects to be initialized with approximate iconic and linguistic models-derived from (generally errorful) OCR results and (generally imperfect) dictionaries-and then, guided entirely by evidence internal to the test set, corrects the models which, in turn, yields higher recognition accuracy. The iconic model describes image formation and determines the behavior of a character-image classifier, and the linguistic model describes word-occurrence probabilities. Our algorithm detects “disagreements” between these two models by measuring cross entropy between 1) the posterior probability distribution of character classes (the recognition results resulting from image classification alone) and 2) the posterior probability distribution of word classes (the recognition results from image classification combined with linguistic constraints). We show how disagreements can identify candidates for model corrections at both the character and word levels. Some model corrections will reduce the error rate over the whole book, and these can be identified by comparing model disagreements, summed across the whole book, before and after the correction is applied. Experiments on passages up to 180 pages long show that when a candidate model adaptation reduces whole-book disagreement, it is also likely to correct recognition errors. Also, the longer the passage operated on by the algorithm, the more reliable this adaptation policy becomes, and the lower the error rate achieved. The best results occur when both the iconic and linguistic models mutually correct one another. We have observed recognition error rates driven down by nearly an order of magnitude fully automatically without supervision (or indeed without any user intervention or interaction). Improvement is nearly monotonic, and asymptotic accuracy is stable, even over long runs. If implemented naively, the algorithm runs in time quadratic in the length of the book, but random subsampling and caching techniques speed it up by two orders of magnitude with negligible loss of accuracy. Whole-book recognition has potential applications in digital libraries as a safe unsupervised anytime algorithm.
INDEX TERMS
probability, cache storage, digital libraries, document image processing, image classification, image sampling, optical character recognition, unsupervised anytime algorithm, whole-book recognition, document image analysis strategy, automatic adaptation, approximate iconic models, linguistic models, OCR results, recognition accuracy, image formation, character-image classifier, word-occurrence probabilities, posterior probability distribution, whole-book disagreement, adaptation policy, caching techniques, random subsampling, digital libraries, Adaptation models, Pragmatics, Image recognition, Character recognition, Optical character recognition software, Error analysis, Computational modeling, cross entropy, Whole-book recognition, document image recognition, book recognition, style consistency, isogeny, adaptive classification, adaptive OCR, adaptive machine learning, model adaptation, anytime algorithm
CITATION
Pingping Xiu, H. S. Baird, "Whole-Book Recognition", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.34, no. 12, pp. 2467-2480, Dec. 2012, doi:10.1109/TPAMI.2012.50
REFERENCES
[1] G. Nagy, and G.L. Shelton, "Self-Corrective Character Recognition System," IEEE Trans. Information Theory, vol. IT-12, no. 2, pp. 215-222, Apr. 1966.
[2] G. Nagy and H.S. Baird, "A Self-Correcting 100-Font Classifier," Proc. IS&T/SPIE Symp. Electronic Imaging: Science & Technology, Feb. 1994.
[3] T. Hong, "Degraded Text Recognition Using Visual and Linguistic Context," PhD dissertation, State Univ. of New York at Buffalo, 1995.
[4] P. Sarkar and G. Nagy, "Style Consistent Classification of Isogenous Patterns," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 1, pp. 88-98, Jan. 2005.
[5] P. Sarkar, "An Iterative Algorithm for Optimal Style Conscious Field Classification," Proc. IAPR 16th Int'l Conf. Pattern Recognition, vol. 4, pp. 40-43, 2002.
[6] S. Veeramachaneni and G. Nagy, "Style Context with Second Order Statistics," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 1, pp. 14-22, Jan. 2005.
[7] T.M. Breuel, "Classification by Probabilistic Clustering," Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 1333-1336, 2001.
[8] G. Kopec, M. Said, and K. Popat, "N-Gram Language Models for Document Image Decoding," Proc. IS&T/SPIE Electronic Imaging 2002 Document Recognition and Retrieval IV, Jan. 2002.
[9] P. Sarkar, H.S. Baird, and X. Zhang, "Training on Severely Degraded Text-Line Images," Proc. IAPR Seventh Int'l Conf. Document Analysis and Recognition, Aug. 2003.
[10] M. Decerbo, P. Natarajan, R. Prasad, E. MacRostie, and A. Ravindran, "Performance Improvements to the BBN Byblos OCR System," Proc. Eighth Int'l Conf. Document Analysis and Recognition, vol. 1, pp. 411-415, 2005.
[11] "The OCRopus(TM) Open Source Document Analysis and OCR System," alpha release, http://code.google.com/pocropus/, Oct. 2007.
[12] J. Weinman, E. Learned-Miller, and A. McCallum, "Fast Lexicon-Based Scene Text Recognition with Sparse Belief Propagation," Proc. IAPR Ninth Int'l Conf. Document Analysis and Recognition, Sept. 2007.
[13] A. Susuki and S. Miyahara, "Word Recognition Coping with Undefined Words," Inst. of Electronics, Information, and Comm. Eng., vol. J76-D-II, no. 3, pp. 464-473, Mar. 1993.
[14] T. Hamamura, T. Akagi, and B. Irie, "An Analytic Word Recognition Algorithm Using A Posteriori Probability," Proc. IAPR Ninth Int'l Conf. Document Analysis and Recognition, Sept. 2007.
[15] T. Kanungo, "Document Degradation Models and a Methodology for Degradation Model Validation," PhD dissertation, Univ. of Washington, Seattle, 1996.
[16] Y. Li, D. Lopresti, G. Nagy, and A. Tomkins, "Validation of Image Defect Models for Optical Character Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 2, pp. 99-108, Feb. 1996.
[17] G. Kopec and P. Chou, "Document Image Decoding Using Markov Source Models," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 6, pp. 602-617, June 1994.
[18] A.C. Kam, "Heuristic Document Image Decoding Using Separable Markov Models," PhD dissertation, Massachusetts Inst. of Tech nology, 1993.
[19] T.P. Minka, D.S. Bloomberg, and K. Popat, "Document Image Decoding Using Iterated Complete Path Heuristic," Proc. IS&T/SPIE Electronic Imaging '01: Document Recognition and Retrieval VIII, Jan. 2001.
[20] K. Popat, D. Greene, J. Romberg, and D.S. Bloomberg, "Adding Linguistic Constraints to Document Image Decoding: Comparing the Iterated Complete Path and Stack Algorithms," Proc. IS&T/SPIE Electronic Imaging '01: Document Recognition and Retrieval VIII, Jan. 2001.
[21] S. Veeramachaneni and G. Nagy, "Analytical Results on Style-Constrained Bayesian Classification of Pattern Fields," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 7, pp. 1280-1285, July 2007.
[22] G. Nagy, S. Seth, and K. Einspahr, "Decoding Substitution Ciphers by Means of Word Matching with Application to OCR," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 9, no. 5, pp. 710-715, Sept. 1987.
[23] C. Fang and J.J. Hull, "A Modified Character-Level Deciphering Algorithm for OCR in Degraded Documents," Proc. SPIE/IS&T Conf. Document Recognition II, pp. 76-83, Feb. 1995.
[24] T.K. Ho and G. Nagy, "OCR with No Shape Training," Proc. IAPR 15th Int'l Conf. Pattern Recognition, vol. 4, p. 4027, 2000.
[25] C. Fang, "Deciphering Algorithms for Degraded Document Recognition," PhD dissertation, State Univ. of New York at Buffalo, 1997.
[26] S. Leishman, "Shape-Free Statistical Information in Optical Character Recognition," master's thesis, Computer Science, Univ. of Toronto, 2007.
[27] G. Huang, E. Learned-Miller, and A. McCallum, "Cryptogram Decoding for OCR Using Numerization Strings," Proc. IAPR Ninth Int'l Conf. Document Analysis and Recognition, Sept. 2007.
[28] P. Xiu and H.S. Baird, "Whole-Book Recognition Using Mutual-Entropy-Based Model Adaptation," Proc., IS&T/SPIE Document Recognition & Retrieval XII Conf., Jan. 2008.
[29] S. Zilberstein, "Using Anytime Algorithms in Intelligent Systems," AI Magazine, vol. 17, no. 3, pp. 73-83, 1996.
[30] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley, 2001.
[31] J. Shore and R. Johnson, "Properties of Cross-Entropy Minimization," IEEE Trans. Information Theory, vol. 27, no. 4, pp. 472-482, July 1981.
[32] L. Vincent, "Google Book Search: Document Understanding on a Massive Scale," Proc. IAPR Ninth Int'l Conf. Document Analysis and Recognition, Aug. 2007.
[33] P. Xiu and H.S. Baird, "Towards Whole-Book Recognition," Proc. Eighth IAPR Document Analysis Workshop, Sept. 2008.
[34] P. Xiu and H.S. Baird, "Scaling-Up Whole-Book Recognition," Proc. IAPR 10th Int'l Conf. Document Analysis and Recognition, July 2009.
[35] P. Xiu and H.S. Baird, "Analysis of Whole-Book Recognition," Proc. Ninth IAPR Document Analysis Workshop, June 2010.
[36] P. Xiu and H.S. Baird, "Incorporating Linguistic Model Adaptation into Whole-Book Recognition," Proc. IAPR 20th Int'l Conf. Pattern Recognition, Aug. 2010.
[37] P. Xiu and H.S. Baird, "Incorporating a Rich Linguistic Model into Whole-Book Recognition," Proc. IS&T/SPIE Document Recognition & Retrieval XVII Conf., Jan. 2010.
26 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool