This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Determination of the Script and Language Content of Document Images
March 1997 (vol. 19 no. 3)
pp. 235-245

Abstract—Most document recognition work to date has been performed on English text. Because of the large overlap of the character sets found in English and major Western European languages such as French and German, some extensions of the basic English capability to those languages have taken place. However, automatic language identification prior to optical character recognition is not commonly available and adds utility to such systems.

Languages and their scripts have attributes that make it possible to determine the language of a document automatically. Detection of the values of these attributes requires the recognition of particular features of the document image and, in the case of languages using Latin-based symbols, the character syntax of the underlying language.

We have developed techniques for distinguishing which language is represented in an image of text. This work is restricted to a small but important subset of the world's languages. The method first classifies the script into two broad classes: Han-based and Latin-based. This classification is based on the spatial relationships of features related to the upward concavities in character structures. Language identification within the Han script class (Chinese, Japanese, Korean) is performed by analysis of the distribution of optical density in the text images. We handle 23 Latin-based languages using a technique based on character shape codes, a representation of Latin text that is inexpensive to compute.

[1] H. Baird and D. Ittner, "Language-Free Layout Analysis," Proc. Second Int'l Conf. Document Analysis and Recognition,Tsukuba, Japan, pp. 336-340, Oct. 1993.
[2] W. Cavnar and J. Trenkle, "N-Gram-Based Text Categorization," Proc. Symp. Document Analysis and Information Retrieval,Las Vegas, pp. 161-169, Apr. 1994.
[3] "CCITT Recommendation T.6 Facsimile Coding Schemes and Coding Control Functions for Group 4 Facsimile Apparatus," Terminal Equipment and Protocols for the Telematic Services, vol. VII, fascicle VII.3, Geneva, 1985
[4] T. Dunning, "Statistical Identification of Language," CRL Technical Memo, MCCS-94-273, 1994.
[5] J. Hochberg, L. Kerns, P. Kelly, and T. Thomas, "Automatic Script Identification From Images Using Cluster-Based Templates," Proc. Int'l Conf. Document Analysis and Recognition,Montreal, pp. 378-381, Aug. 1995.
[6] J. Hull, S. Khoubyari, and T.K. Ho, "Word Image Matching as a Technique for Degraded Text Recognition," Proc. Int'l Conf. Pattern Recognition, The Hague, pp. B665-B668, Sept. 1992.
[7] R. Hunter and A. Robinson, "International Digital Facsimile Coding Standards," Proc. IEEE, no. 68, pp. 854-867, July 1980.
[8] D. Ittner, "Automatic Inference of Textline Orientation," Proc. Symp. Document Analysis and Information Retrieval,Las Vegas, pp. 123-133, Apr. 1992.
[9] D. Ittner and H. Baird, "Programmable Contextual Analysis," Document Analysis Systems, pp. 76-92, A.L. Spitz and A. Dengel, eds. Singapore: World Scientific, 1995.
[10] J. Kanai, "Text Line Extraction Using Character Prototypes," Proc. IAPR Workshop on Syntactic and Structural Pattern Recognition,Murray Hill, N.J., pp. 182-191, June 1990.
[11] A. Nakanishi, Writing Systems of the World.Rutland, England: Tuttle, 1980.
[12] T. Nakayama and A. Spitz, "European Language Determination From Image," Proc. Int'l Conf. Document Analysis and Recognition,Tsukuba, Japan, pp. 159-162, Oct. 1993.
[13] C. Ronse and P. Devijver, Connected Components in Binary Images: The Detection Problem. Research Studies Press, 1984.
[14] J. Schürmann, N. Bartneck, T. Bayer, J. Franke, E. Mandler, and M. Oberländer, "Document Analysis—From Pixels to Contents," Proc. IEEE, Special Issue on OCR, vol. 80, pp. 1,101-1,119, July 1992.
[15] P. Sibun and A. Spitz, "Language Determination: Natural Language Processing From Scanned Document Images," Proc. Applied Natural Language Processing, Stuttgart, pp. 15-21, Oct. 1994.
[16] P. Sibun and J. Reynar, "Language Identification: Examining the Issues," Proc. Symp. Document Analysis and Information Retrieval, pp. 125-135, Apr. 1996.
[17] C. Souter, G. Churcher, J. Hayes, J. Hughes, and S. Johnson, "Natural Language Identification Using Corpus-Based Models," Hermes J. Linguistics, vol. 13, pp. 183-203, 1994.
[18] A. Spitz, "Multilingual Document Recognition," Electronic Publishing, Document Manipulation, and Typography, R. Furuta, ed. Cambridge Univ. Press, pp. 193-206, 1990.
[19] A. Spitz, "Skew Angle Determination in Group 4 Compressed Document Images," Proc. Symp. Document Analysis and Information Retrieval, pp. 11-25, Apr. 1992.
[20] A. Spitz, "Generalized Line, Word and Character Finding," Progress in Image Analysis and Processing III, pp. 377-383, S. Impedovo, ed. Singapore: World Scientific, 1993.
[21] A. Spitz, "Text Line Characterization by Connected Component Transformations," Proc. SPIE,San Jose, Calif., pp. 97-105, Feb. 1994.
[22] A. Spitz, "Script and Language Determination From Document Images," Proc. Symp. Document Analysis and Information Retrieval,Las Vegas, pp. 229-235, Apr. 1994.
[23] A. Spitz and M. Ozaki, "Palace: A Multilingual Document Recognition System," Document Analysis Systems, pp. 16-37, A.L. Spitz and A. Dengel, eds. Singapore: World Scientific, 1995.
[24] Y. Tanaka and H. Torii, "Transmedia Machine and Its Keyword Search Over Image Texts," Proc. Recherche d'Information assistèe par Ordinateur,Cambridge, Mass., pp. 248-258, Mar. 1988.
[25] S. Wood, X. Yao, K. Krishnamurthi, and L. Dang, "Language Identification for Printed Text Independent of Segmentation," Int'l Conf. Image Processing, pp. 428-431, Oct. 1995.

Index Terms:
Multilingual, script classification, machine printed OCR, language classification, Han-based languages, Latin-based languages, Asian scripts.
Citation:
A. Lawrence Spitz, "Determination of the Script and Language Content of Document Images," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 235-245, March 1997, doi:10.1109/34.584100
Usage of this product signifies your acceptance of the Terms of Use.