Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 1 Identification of Latin-Based Languages through Character Stroke Categorization Curitiba, Parana, Brazil September 23-September 26 ISBN: 0-7695-2822-8
This paper presents a language identification technique that detects Latin-based languages of imaged documents without OCR. The proposed technique detects languages through the word shape coding, which converts each word image into a word shape code and accordingly transforms each document image into an electronic document vector. For each Latin-based language under study, a language template is first constructed through a corpus-based learn- ing process. The underlying language of the query docu- ment is then determined based on the similarity between the query document vector and multiple constructed language templates. Compared with the reported methods, the pro- posed language identification technique is fast, accurate, and tolerant to text segmentation error caused by noise and various types of document degradation. Experimental re- sults show some promising results.
Citation:
S.J. Lu, L. Li, Chew Lim Tan, "Identification of Latin-Based Languages through Character Stroke Categorization," icdar, vol. 1, pp.352-356, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 1, 2007 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||