2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2017)
Nov. 9, 2017 to Nov. 15, 2017
With the development of globalization, script identification has become a active field in the document image processing. However, many methods only have good recognition effect on the scripts of particular countries and areas, and cannot be applied to all scripts. Especially for Central Asia, there are few such research. In this paper, Nonsubsampled Contourlet Transform (NSCT) was used for the texture feature extraction of document images in Central Asian scripts, and K Nearest Neighbor (KNN) classifier was used for classification. A total of 7,000 document images of 10 scripts including English, Chinese, Uyghur, Tibetan, Arabic, Turkish, Mongolian, Russian, Kazakhstan, Kyrgyzstan were classified and 98.7% of average accuracy was obtained. Experimental results indicate that the method of script identification proposed in this paper is effective for multi-scripts document image, especially for Central Asian scripts.
document image processing, feature extraction, image classification, image segmentation, image texture, natural language processing, optical character recognition, text analysis, transforms
X. Han, A. Aysa, N. Yadikar, H. Mamat and K. Ubul, "Script Identification Based on Nonsubsampled Contourlet Transform," 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 2018, pp. 697-702.