Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 1
Identification of Latin-Based Languages through Character Stroke Categorization
Curitiba, Parana, Brazil
September 23-September 26
ISBN: 0-7695-2822-8
S.J. Lu, National University of Singapore, Kent Ridge, 117543, Singapore
L. Li, National University of Singapore, Kent Ridge, 117543, Singapore
Chew Lim Tan, National University of Singapore, Kent Ridge, 117543, Singapore
This paper presents a language identification technique that detects Latin-based languages of imaged documents without OCR. The proposed technique detects languages through the word shape coding, which converts each word image into a word shape code and accordingly transforms each document image into an electronic document vector. For each Latin-based language under study, a language template is first constructed through a corpus-based learn- ing process. The underlying language of the query docu- ment is then determined based on the similarity between the query document vector and multiple constructed language templates. Compared with the reported methods, the pro- posed language identification technique is fast, accurate, and tolerant to text segmentation error caused by noise and various types of document degradation. Experimental re- sults show some promising results.
Citation:
S.J. Lu, L. Li, Chew Lim Tan, "Identification of Latin-Based Languages through Character Stroke Categorization," icdar, vol. 1, pp.352-356, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 1, 2007