This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Script and Language Identification in Noisy and Degraded Document Images
January 2008 (vol. 30 no. 1)
pp. 14-24
This paper reports an identification technique that detects scripts and languages of noisy and degraded document images. In the proposed technique, scripts and languages are identified through the document vectorization, which converts each document image into a document vector that characterizes the shape and frequency of the conta ned character or word images. Document images are vectorized by using vertical component cuts and character extremum points, which are both tolerant to the variation in text fonts and styles, noise, and various types of document degradation. For each script or language under study, a script or language template is first constructed through a training process. Scripts and languages of document images are then determined according to the distances between converted document vectors and the pre-constructed script and language templates. Experimental results show that the proposed technique is accurate, easy for extension, and tolerant to noise and various types of document degradation.

[1] W. Cavnar and J. Trenkle, “N-Gram Based Text Categorization,” Proc. Third Ann. Symp. Document Analysis and Information Retrieval, pp. 161-175, 1994.
[2] T. Dunning, “Statistical Identification of Language,” technical report, Computing Research Laboratory, New Mexico State Univ., 1994.
[3] D.S. Lee, C.R. Nohl, and H.S. Baird, “Language Identification in Complex, Unoriented, and Degraded Document Images,” Proc. Int'l Workshop Document Analysis Systems, pp. 76-88, 1996.
[4] A.L. Spitz, “Determination of Script and Language Content of Document Images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 235-245, Mar. 1997.
[5] J. Hochberg, L. Kerns, P. Kelly, and T. Thomas, “Automatic Script Identification from Images Using Cluster-Based Templates,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp.176-181, Feb. 1997.
[6] J. Hochberg, L. Kerns, P. Kelly, and T. Thomas, “Automatic Script Identification from Images Using Cluster-Based Templates,” Proc. Third Int'l Conf. Document Analysis and Recognition, pp. 378-381, 1995.
[7] A.K. Jain and Y. Zhong, “Page Segmentation Using Texture Analysis,” Pattern Recognition, vol. 29, no. 5, pp. 743-770, 1996.
[8] T.N. Tan, “Rotation Invariant Texture Features and Their Use in Automatic Script Identification,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 7, pp. 751-756, July 1998.
[9] A. Busch, W.W. Boles, and S. Sridharan, “Texture for Script Identification,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1720-1732, Nov. 2005.
[10] J. Ding, L. Lam, and C.Y. Suen, “Classification of Oriental and European Scripts by Using Characteristic Features,” Proc. Int'l Conf. Document Analysis and Recognition, pp. 1023-1027, 1997.
[11] U. Pal and B.B. Chaudhury, “Identification of Different Script Lines from Multi-Script Documents,” Image and Vision Computing, vol. 20, no. 13-14, pp. 945-954, 2002.
[12] A.M. Elgammmal and M.A. Ismail, “Techniques for Language Identification for Hybrid Arabic-English Document Images,” Proc. Sixth Int'l Conf. Document Analysis and Recognition, pp. 1100-1104, 2001.
[13] N. Nobile, S. Bergler, C.Y. Suen, and S. Khoury, “Language Identification of Online Documents Using Word Shapes,” Proc. Fourth Int'l Conf. Document Analysis and Recognition, pp. 258-262, 1997.
[14] C.Y. Suen, S. Bergler, N. Nobile, B. Waked, and C.P. Nadal, “Categorizing Document Images into Script and Language Classes,” Proc. Int'l Conf. Advances in Pattern Recognition, pp. 297-306, 1998.
[15] S. Lu and C.L. Tan, “Language Identification in Degraded and Distorted Document Images,” Proc. Seventh IAPR Workshop Document Analysis Systems, pp. 232-242, 2006.
[16] S. Lu and C.L. Tan, “Script and Language Identification in Degraded and Distorted Document Images,” Proc. 21st Nat'l Conf. Artificial Intelligence, pp. 769-774, 2006.
[17] N. Otsu, “A Threshold Selection Method from Gray-Level Histogram,” IEEE Trans. Systems, Man, Cybernetics, vol. 19, no. 1, pp. 62-66, 1978.
[18] C. Ronse and P. Devijver, “Connected Components in Binary Images: The Detection Problem,” Research Studies Press, 1984.
[19] O.D. Trier and T. Taxt, “Evaluation of Binarization Methods for Document Images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 3, pp. 312-315, Mar. 1995.
[20] S.J. Ko and Y.H. Lee, “Center Weighted Median Filters and Their Applications to Image Enhancement,” IEEE Trans. Circuits and Systems, vol. 38, no. 9, pp. 984-993, 1991.

Index Terms:
Document analysis, shape, script identification, language identification, clustering, classification, association rules
Citation:
Lu Shijian, Chew Lim Tan, "Script and Language Identification in Noisy and Degraded Document Images," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 1, pp. 14-24, Jan. 2008, doi:10.1109/TPAMI.2007.1158
Usage of this product signifies your acceptance of the Terms of Use.