This Article 
 Bibliographic References 
 Add to: 
An Omnifont Open-Vocabulary OCR System for English and Arabic
June 1999 (vol. 21 no. 6)
pp. 495-504

Abstract—We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on Hidden Markov Models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. In this paper we focus on two aspects of the OCR system. First, we address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. We demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, we show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method includes the use of a trigram language model on character sequences. Using all these techniques, we have achieved character error rates of 1.1 percent on data from the University of Washington English Document Image Database and 3.3 percent on data from the DARPA Arabic OCR Corpus.

[1] K. Aas and L. Eikvil, "Text Page Recognition Using Grey-Level Features and Hidden Markov Models," Pattern Recognition, vol. 29, pp. 977-985, 1996.
[2] B. Al-Badr and S. Mahmoud, "Survey and Bibliography of Arabic Optical Text Recognition," Signal Processing, vol. 41, no. 1, pp. 49-77, 1995.
[3] M. Allam, "Segmentation Versus Segmentation-Free for Recognizing Arabic Text," Proc. SPIE, vol. 2,422, pp. 228-235, 1995.
[4] J. Bellegarda and D. Nahamoo, "Tied Mixture Continuous Parameter Models for Large Vocabulary Isolated Speech Recognition," IEEE Int'l Conf. Acoustics, Speech, Signal Processing, vol. 1, pp. 13-16,Glasgow, Scotland, May 1989.
[5] N. Ben Amara and A. Belaid, "Printed PAW Recognition Based on Planar Hidden Markov Models," 13th Int'l Conf. Pattern Recognition, vol. 2, pp. 220-224,Vienna, 1996.
[6] W. Cho, S.-W. Lee, and J.H. Kim, "Modeling and Recognition of Cursive Words With Hidden Markov Models," Pattern Recognition, vol. 28, pp. 1,941-1,953, 1995.
[7] R.B. Davidson and R.L. Hopley, "Arabic and Persian OCR Training and Test Data Sets," Proc. Symp. Document Image Understanding Technology (SDIUT97), pp. 303-307,Annapolis, Md., 1997.
[8] A.J. Elms and J. Illingworth, "Modelling Polyfont Printed Characters With HMMs and a Shift Invariant Hamming Distance," Proc. Int'l Conf. Document Analysis and Recognition, pp. 504-507,Montreal, Canada, 1995.
[9] A. Kaltenmeier, T. Caesar, J.M. Gloger, and E. Mandler, “Sophisticated Topology of Hidden Markov Models for Cursive Script Recognition,” Proc. Second Int'l Conf. Document Analysis and Recognition, pp. 139-142, 1993.
[10] A. Kornai, "Experimental HMM-Based Postal OCR System," Proc. Int'l Conf. Acoustics, Speech, Signal Processing, vol. 4, pp. 3,177-3,180,Munich, Germany, 1997.
[11] J. Makhoul, S. Roucos, and H. Gish, "Vector Quantization in Speech Coding," Proc. IEEE, vol. 73, pp. 1,551-1,588, 1985.
[12] J. Makhoul and R. Schwartz, "State of the Art in Continuous Speech Recognition," Proc. Nat'l Acad. Sci. USA, vol. 92, pp. 9,956-9,963, Oct. 1995.
[13] J. Makhoul, R. Schwartz, C. LaPre, C. Raphael, and I. Bazzi, "Language-Independent and Segmentation-Free Techniques for Optical Character Recognition," Document Analysis Systems Workshop, pp. 99-114,Malvern, Pa., Oct. 1996.
[14] M. Mohammed and P. Gader, “Handwritten Word Recognition Using Segmentation-Free Hidden Markov Modeling and Segmentation-Based Dynamic Programming Techniques,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 5, pp. 548-554, May 1996.
[15] L. Nguyen, T. Anastasakos, F. Kubala, C. LaPre, J. Makhoul, R. Schwartz, N. Yuan, G. Zavaliagkos, and Y. Zhao, "The 1994 BBN/BYBLOS Speech Recognition System," Proc. ARPA Spoken Language Systems Technology Workshop, pp. 77-81,Austin, Texas, Jan. 1995. San Mateo, Calif.: Morgan Kaufmann Publishers, 1995.
[16] I. Phillips, S. Chen, and R. Haralick, “CD-ROM Document Database Standard,” Proc. Second Int'l Conf. Document Analysis and Recognition, pp. 478-483, 1993.
[17] L.R. Rabiner, “Tutorial on Hidden Markov Model and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257-285, 1989.
[18] R. Schwartz, C. LaPre, J. Makhoul, C. Raphael, and Y. Zhao, "Language-Independent OCR Using a Continuous Speech Recognition System," Proc. Int'l Conf. Pattern Recognition, pp. 99-103,Vienna, Aug. 1996.
[19] R. Schwartz, L. Nguyen, and J. Makhoul, "Multiple-Pass Search Strategies," C.-H. Lee, F.K. Soong, and K.K. Paliwal, eds., Automatic Speech and Speaker Recognition: Advanced Topics. Kluwer Academic Publishers, 1996, pp. 429-456.
[20] J. Makhoul, T. Starner, R. Schartz, and G. Lou, “On-Line Cursive Handwriting Recognition Using Speech Recognition Models,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. v125-v128, Adelaide, Australia, 1994.
[21] F.T. Yarman-Vural and A. Atici, "A Heuristic Algorithm for Optical Character Recognition of Arabic Script," Proc. SPIE, vol. 2,727, part 2, pp. 725-736, 1996.
[22] I. Bazzi, C. LaPre, J. Makhoul, and R. Schwartz, "Omnifont and Unlimited Vocabulary OCR for English and Arabic," Proc. Int'l Conf. Document Analysis and Recognition, vol. 2, pp. 842-846,Ulm, Germany, 1997.
[23] C.B. Bose and S.-S. Kuo, "Connected and Degraded Text Recognition Using Hidden Markov Model," Pattern Recognition, vol. 27, pp. 1,345-1,363, 1994.

Index Terms:
Optical character recognition, speech recognition, Hidden Markov Models, Omnifont OCR, language modeling, Arabic OCR, segmentation-free recognition.
Issam Bazzi, Richard Schwartz, John Makhoul, "An Omnifont Open-Vocabulary OCR System for English and Arabic," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 6, pp. 495-504, June 1999, doi:10.1109/34.771314
Usage of this product signifies your acceptance of the Terms of Use.