This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Script Recognition—A Review
December 2010 (vol. 32 no. 12)
pp. 2142-2161
Debashis Ghosh, Indian Institute of Technology Roorkee, Roorkee
Tulika Dube, Indian Institute of Management Ahmedabad, Ahmedabad
Adamane P. Shivaprasad, Sambhram Institute of Technology, Bangalore
A variety of different scripts are used in writing languages throughout the world. In a multiscript, multilingual environment, it is essential to know the script used in writing a document before an appropriate character recognition and document analysis algorithm can be chosen. In view of this, several methods for automatic script identification have been developed so far. They mainly belong to two broad categories—structure-based and visual-appearance-based techniques. This survey report gives an overview of the different script identification methodologies under each of these categories. Methods for script identification in online data and video-texts are also presented. It is noted that the research in this field is relatively thin and still more research is to be done, particularly in the case of handwritten documents.

[1] C.Y. Suen, M. Berthod, and S. Mori, "Automatic Recognition of Handprinted Characters—The State of the Art," Proc. IEEE, vol. 68, no. 4, pp. 469-487, Apr. 1980.
[2] J. Mantas, "An Overview of Character Recognition Methodologies," Pattern Recognition, vol. 19, no. 6, pp. 425-430, 1986.
[3] V.K. Govindan and A.P. Shivaprasad, "Character Recognition—A Review," Pattern Recognition, vol. 23, no. 7, pp. 671-683, 1990.
[4] S. Mori, C.Y. Suen, and K. Yamamoto, "Historical Review of OCR Research and Development," Proc. IEEE, vol. 80, no. 7, pp. 1029-1058, July 1992.
[5] H. Bunke and P.S.P. Wang, Handbook of Character Recognition and Document Image Analysis. World Scientific Publishing, 1997.
[6] N. Nagy, "Twenty Years of Document Image Analysis in PAMI," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 38-62, Jan. 2000.
[7] U. Pal, "Automatic Script Identification: A Survey," J. Vivek, vol. 16, no. 3, pp. 26-35, 2006.
[8] U. Pal and B.B. Chaudhuri, "Indian Script Character Recognition: A Survey," Pattern Recognition, vol. 37, no. 9, pp. 1887-1899, Sept. 2004.
[9] L. Peng, C. Liu, X. Ding, and H. Wang, "Multilingual Document Recognition Research and Its Application in China," Proc. Int'l Conf. Document Image Analysis for Libraries, pp. 126-132, Apr. 2006.
[10] A. Nakanishi, Writing Systems of the World: Alphabets, Syllabaries, Pictograms. Charles E. Tuttle Co., 1980.
[11] F. Coulmas, The Blackwell Encyclopedia of Writing Systems. Blackwell Publishers, 1996.
[12] C. Ronse and P.A. Devijver, Connected Components in Binary Images: The Detection Problem. John Wiley & Sons, 1984.
[13] A.L. Spitz, "Multilingual Document Recognition," Proc. Int'l Conf. Electronic Publishing, Document Manipulation, and Typography, pp. 193-206, Sept. 1990.
[14] A.L. Spitz and M. Ozaki, "Palace: A Multilingual Document Recognition System," Proc. IAPR Workshop Document Analysis Systems, pp. 16-37, Oct. 1994.
[15] A.L. Spitz, "Determination of the Script and Language Content of Document Images," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 235-245, Mar. 1997.
[16] D.-S. Lee, C.R. Nohl, and H.S. Baird, "Language Identification in Complex, Unoriented, and Degraded Document Images," Proc. IAPR Workshop Document Analysis Systems, pp. 76-98, Oct. 1996.
[17] B. Waked, S. Bergler, C.Y. Suen, and S. Khoury, "Skew Detection, Page Segmentation and Script Classification of Printed Document Images," Proc. IEEE Int'l Conf. Systems, Man, and Cybernetics, vol. 5, pp. 4470-4475, Oct. 1998.
[18] L. Lam, J. Ding, and C.Y. Suen, "Differentiating between Oriental and European Scripts by Statistical Features," Int'l J. Pattern Recognition and Artificial Intelligence, vol. 12, no. 1, pp. 63-79, Feb. 1998.
[19] J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, "Automatic Script Identification from Document Images Using Cluster-Based Templates," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp. 176-181, Feb. 1997.
[20] J. Hochberg, K. Bowers, M. Cannon, and P. Kelly, "Script and Language Identification for Handwritten Document Images," Int'l J. Document Analysis and Recognition, vol. 2, nos. 2/3, pp. 45-52, Dec. 1999.
[21] Y. Tho and Y.Y. Tang, "Discrimination of Oriental and Euramerican Scripts Using Fractal Feature," Proc. Int'l Conf. Document Analysis and Recognition, pp. 1115-1119, Sept. 2001.
[22] B.V. Dhandra, P. Nagabhushan, M. Hangarge, R. Hegadi, and V.S. Malemath, "Script Identification Based on Morphological Reconstruction in Document Images," Proc. IEEE Int'l Conf. Pattern Recognition, vol. 2, pp. 950-953, Aug. 2006.
[23] S. Chaudhury and R. Sheth, "Trainable Script Identification Strategies for Indian Languages," Proc. Int'l Conf. Document Analysis and Recognition, pp. 657-660, Sept. 1999.
[24] S.B. Patil and N.V. Subbareddy, "Neural Network Based System for Script Identification in Indian Documents," Sadhana, vol. 27, no. 1, pp. 83-97, Feb. 2002.
[25] Z. Chi, Q. Wang, and W.-C. Siu, "Hierarchical Content Classification and Script Determination for Automatic Document Image Processing," Pattern Recognition, vol. 36, no. 11, pp. 2483-2500, Nov. 2003.
[26] S. Kanoun, A. Ennaji, Y. Lecourtier, and A.M. Alimi, "Script and Nature Differentiation for Arabic and Latin Text Images," Proc. Int'l Workshop Frontiers in Handwriting Recognition, pp. 309-313, Aug. 2002.
[27] L. Zhou, Y. Lu, and C.L. Tan, "Bangla/English Script Identification Based on Analysis of Connected Component Profiles," Proc. Int'l Workshop Document Analysis Systems, pp. 243-254, Feb. 2006.
[28] U. Pal and B.B. Chaudhuri, "Script Line Separation from Indian Multi-Script Documents," Proc. Int'l Conf. Document Analysis and Recognition, pp. 406-409, Sept. 1999.
[29] U. Pal and B.B. Chaudhuri, "Identification of Different Script Lines from Multi-Script Documents," Image and Vision Computing, vol. 20, nos. 13/14, pp. 945-954, Dec. 2002.
[30] U. Pal, S. Sinha, and B.B. Chaudhuri, "Multi-Script Line Identification from Indian Documents," Proc. Int'l Conf. Document Analysis and Recognition, pp. 880-884, Aug. 2003.
[31] A. Elgammal and M.A. Ismail, "Techniques for Language Identification for Hybrid Arabic-English Document Images," Proc. Int'l Conf. Document Analysis and Recognition, pp. 1100-1104, Sept. 2001.
[32] C.S. Cumbee, Method of Identifying Script of Line of Text, US Patent 7020338, Mar. 2006.
[33] S.-W. Lee and J.-S. Kim, "Multi-Lingual, Multi-Font, Multi-Size Large-Set Character Recognition Using Self-Organizing Neural Network," Proc. Int'l Conf. Document Analysis and Recognition, vol. 1, pp. 28-33, Aug. 1995.
[34] J. Hochberg, M. Cannon, P. Kelly, and J. White, "Page Segmentation Using Script Identification Vectors: A First Look," Proc. Symp. Document Image Understanding Technology, pp. 258-264, Apr./May 1997.
[35] D. Ghosh and A.P. Shivaprasad, "Handwritten Script Identification Using Possibilistic Approach for Cluster Analysis," J. Indian Inst. of Science, vol. 80, pp. 215-224, May/June 2000.
[36] V. Ablavsky and M.R. Stevens, "Automatic Feature Selection with Applications to Script Identification of Degraded Documents," Proc. Int'l Conf. Document Analysis and Recognition, pp. 750-754, Aug. 2003.
[37] R. Krishnapuram and J.M. Keller, "A Possihilistic Approach to Clustering," IEEE Trans. Fuzzy Systems, vol. 1, no. 2, pp. 98-110, May 1993.
[38] D. Ghosh and A.P. Shivaprasad, "An Analytic Approach for Generation of Artificial Handprinted Character Database from Given Generative Models," Pattern Recognition, vol. 32, no. 6, pp. 907-920, June 1999.
[39] D.W. Muir and T. Thomas, Automatic Language Identification by Stroke Geometry Analysis, US Patent 6064767, May 2000.
[40] Y.-H. Liu, C.-C. Lin, and F. Chang, "Language Identification of Character Images Using Machine Learning Techniques," Proc. Int'l Conf. Document Analysis and Recognition, vol. 2, pp. 630-634, Aug./Sept. 2005.
[41] I. Moalla, A. Elbaati, A.M. Alimi, and A. Benhamadou, "Extraction of Arabic Text from Multilingual Documents," Proc. IEEE Int'l Conf. Systems, Man, and Cybernetics, http://ieeexplore.ieee.org/iel5/8325/26298 01173266.pdf?arnumber=1173266, Oct. 2002.
[42] I. Moalla, A.M. Alimi, and A. Benhamadou, "Extraction of Arabic Words from Multilingual Documents," Proc. Conf. Artificial Intelligence and Soft Computing, http://www.actapress.comPDFViewer.aspx?paperId=18567 , Sept. 2004.
[43] C.L. Tan, P.Y. Leong, and S. He, "Language Identification in Multi-Lingual Documents," Proc. Int'l Symp. Intelligent Multimedia and Distance Education, pp. 59-64, Aug. 1999.
[44] S. Lu, C.L. Tan, and W. Huang, "Language Identification in Degraded and Distorted Document Images," Proc. Int'l Workshop Document Analysis Systems, pp. 232-242, Feb. 2006.
[45] C.V. Jawahar, M.N.S.S.K. Pavan Kumar, and S.S. Ravi Kiran, "A Bilingual OCR for Hindi-Telugu Documents and Its Applications," Proc. Int'l Conf. Document Analysis and Recognition, pp. 408-412, Aug. 2003.
[46] S. Sinha, U. Pal, and B.B. Chaudhuri, "Word-Wise Script Identification from Indian Documents," Proc. IAPR Int'l Workshop Document Analysis Systems, pp. 310-321, Sept. 2004.
[47] S. Chanda, S. Sinha, and U. Pal, "Word-Wise English Devnagari and Oriya Script Identification," Speech and Language Systems for Human Communication, R.M.K. Sinha and V.N. Shukla, eds., pp. 244-248, Tata McGraw-Hill, 2004.
[48] S. Chanda and U. Pal, "English, Devnagari and Urdu Text Identification," Proc. Int'l Conf. Cognition and Recognition, pp. 538-545, Dec. 2005.
[49] S. Chanda, R.K. Roy, and U. Pal, "English and Tamil Text Identification," Proc. Nat'l Conf. Recent Trends in Information Systems, pp. 184-187, July 2006.
[50] M.C. Padma and P. Nagabhushan, "Identification and Separation of Text Words of Kannada, Hindi and English Languages through Discriminating Features," Proc. Nat'l Conf. Document Analysis and Recognition, pp. 252-260, July 2003.
[51] R. Kumar, V. Chaitanya, and C.V. Jawahar, "A Novel Approach to Script Separation," Proc. Int'l Conf. Advances in Pattern Recognition, pp. 289-292, Dec. 2003.
[52] K. Roy, U. Pal, and B.B. Chaudhuri, "Address Block Location and Pin Code Recognition for Indian Postal Automation," Proc. Workshop Computer Vision, Graphics, and Image Processing, pp. 5-9, Feb. 2004.
[53] K. Roy, S. Vajda, U. Pal, B.B. Chaudhuri, and A. Belaid, "A System for Indian Postal Automation," Proc. Int'l Conf. Document Analysis and Recognition, vol. 2, pp. 1060-1064, Aug./Sept. 2005.
[54] K. Roy, D. Pal, and U. Pal, "Pin-Code Extraction and Recognition for Indian Postal Automation," Proc. Nat'l Conf. Recent Trends in Information Systems, pp. 192-195, July 2006.
[55] K. Roy and U. Pal, "Word-Wise Hand-Written Script Separation for Indian Postal Automation," Proc. Int'l Workshop Frontiers in Handwriting Recognition, pp. 521-526, Oct. 2006.
[56] K. Roy, U. Pal, and B.B. Chaudhuri, "Neural Network Based Word-Wise Handwritten Script Identification System for Indian Postal Automation," Proc. Int'l Conf. Intelligent Sensing and Information Processing, pp. 240-245, Jan. 2005.
[57] S.L. Wood, X. Yao, K. Krishnamurthi, and L. Dang, "Language Identification for Printed Text Independent of Segmentation," Proc. Int'l Conf. Image Processing, vol. 3, pp. 428-431, Oct. 1995.
[58] T.N. Tan, "Rotation Invariant Texture Features and Their Use in Automatic Script Identification," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 7, pp. 751-756, July 1998.
[59] L. O'Gorman and R. Kasturi, Document Image Analysis. IEEE CS Press, 1995.
[60] G.S. Peake and T.N. Tan, "Script and Language Identification from Document Images," Proc. Asian Conf. Computer Vision, pp. 97-104, Jan. 1998.
[61] R.M. Haralick, K. Shanmugam, and I. Dinstein, "Textural Features for Image Classification," IEEE Trans. Systems, Man, and Cybernetics, vol. 3, no. 6, pp. 610-621, Nov. 1973.
[62] W.M. Pan, C.Y. Suen, and T.D. Bui, "Script Identification Using Steerable Gabor Filters," Proc. Int'l Conf. Document Analysis and Recognition, vol. 2, pp. 883-887, Aug./Sept. 2005.
[63] V. Singhal, N. Navin, and D. Ghosh, "Script-Based Classification of Hand-Written Text Documents in a Multilingual Environment," Proc. Int'l Workshop Research Issues in Data Eng.—Multi-Lingual Information Management, pp. 47-54, Mar. 2003.
[64] J. Cheng, X. Ping, G. Zhou, and Y. Yang, "Script Identification of Document Image Analysis," Proc. Int'l Conf. Innovative Computing, Information, and Control, vol. 3, pp. 178-181, Aug./Sept. 2006.
[65] A.K. Jain and Y. Zhong, "Page Segmentation Using Texture Analysis," Pattern Recognition, vol. 29, no. 5, pp. 743-770, May 1996.
[66] A. Busch, W.W. Boles, and S. Sridharan, "Texture for Script Identification," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1720-1732, Nov. 2005.
[67] A. Busch, "Multi-Font Script Identification Using Texture-Based Features," Proc. Int'l Conf. Image Analysis and Recognition, pp. 844-852, Sept. 2006.
[68] G.D. Joshi, S. Garg, and J. Sivaswamy, "Script Identification from Indian Documents," Proc. IAPR Int'l Workshop Document Analysis Systems, pp. 255-267, Feb. 2006.
[69] W. Chan and G.G. Coghill, "Text Analysis Using Local Energy," Pattern Recognition, vol. 34, no. 12, pp. 2523-2532, Dec. 2001.
[70] H. Ma and D. Doermann, "Gabor Filter Based Multi-Class Classifier for Scanned Document Images," Proc. Int'l Conf. Document Analysis and Recognition, pp. 968-972, Aug. 2003.
[71] S. Jaeger, H. Ma, and D. Doermann, "Identifying Script on Word-Level with Informational Confidence," Proc. Int'l Conf. Document Analysis and Recognition, vol. 1, pp. 416-420, Aug./Sept. 2005.
[72] D. Dhanya, A.G. Ramkrishnan, and P.B. Pati, "Script Identification in Printed Bilingual Documents," Sadhana, vol. 27, no. 1, pp. 73-82, Feb. 2002.
[73] D. Dhanya and A.G. Ramkrishnan, "Script Identification in Printed Bilingual Documents," Proc. IAPR Int'l Workshop Document Analysis Systems, pp. 13-24, Aug. 2002.
[74] D. Dhanya and A.G. Ramkrishnan, "Optimal Feature Extraction for Bilingual OCR," Proc. IAPR Int'l Workshop Document Analysis Systems, pp. 25-36, Aug. 2002.
[75] P.B. Pati, S. Sabari Raju, N. Pati, and A.G. Ramakrishnan, "Gabor Filters for Document Analysis in Indian Bilingual Documents," Proc. Int'l Conf. Intelligent Sensing and Information Processing, pp. 123-126, Jan. 2004.
[76] P.B. Pati and A.G. Ramakrishnan, "HVS Inspired System for Script Identification in Indian Multi-Script Documents," Proc. Int'l Workshop Document Analysis Systems, pp. 380-389, Feb. 2006.
[77] A.L. Spitz, "Script and Language Determination from Document Images," Proc. Ann. Symp. Document Analysis and Information Retrieval, pp. 229-235, Apr. 1994.
[78] J.J. Lee, B.K. Sin, and J.H. Kim, "On-Line Mixed Character Recognition Using an HMM Network," Proc. KISS Ann. Conf., vol. 20, no. 2, pp. 317-320, Oct. 1993.
[79] J.J. Lee, J.H. Kim, and M. Nakajima, "A Hierarchical HMM Network-Based Approach for On-Line Recognition of Multi-Lingual Cursive Handwritings," IEICE Trans. Information and Systems, vol. E81-D, no. 8, pp. 881-888, Aug. 1998.
[80] A.M. Namboodiri and A.K. Jain, "Online Script Recognition," Proc. Int'l Conf. Pattern Recognition, vol. 3, pp. 736-739, Aug. 2002.
[81] A.M. Namboodiri and A.K. Jain, "Online Handwritten Script Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 1, pp. 124-130, Jan. 2004.
[82] A. Malaviya and L. Peters, "Fuzzy Handwriting Description Language: FOHDEL," Pattern Recognition, vol. 33, no. 1, pp. 119-131, Jan. 2000.
[83] J. Gllavata and B. Freisleben, "Script Recognition in Images with Complex Backgrounds," Proc. IEEE Int'l Symp. Signal Processing and Information Technology, pp. 589-594, Dec. 2005.
[84] B.B. Chaudhuri, "On Multi-Script OCR System Evaluation," Proc. Int'l Workshop Performance Evaluation Issues in Multi-Lingual OCR, http://www.kanungo.com/workshop/abstracts chaudhuri. html, Sept. 1999.
[85] T. Kanungo, P. Resnik, S. Mao, D.-W. Kim, and Q. Zheng, "The Bible and Multilingual Optical Character Recognition," Comm. ACM, vol. 48, no. 6, pp. 124-130, June 2005.

Index Terms:
Document analysis, optical character recognition, script identification, multiscript document.
Citation:
Debashis Ghosh, Tulika Dube, Adamane P. Shivaprasad, "Script Recognition—A Review," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 12, pp. 2142-2161, Dec. 2010, doi:10.1109/TPAMI.2010.30
Usage of this product signifies your acceptance of the Terms of Use.