|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
2011 International Conference on Document Analysis and Recognition
Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines
Beijing, China
September 18-September 21
ISBN: 978-0-7695-4520-2
| ASCII Text | x | ||
| Xiao-Rong Lin, Chien-Yang Guo, Fu Chang, "Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines," Document Analysis and Recognition, International Conference on, pp. 498-502, 2011 International Conference on Document Analysis and Recognition, 2011. | |||
| BibTex | x | ||
| @article{ 10.1109/ICDAR.2011.106, author = {Xiao-Rong Lin and Chien-Yang Guo and Fu Chang}, title = {Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines}, journal ={Document Analysis and Recognition, International Conference on}, volume = {0}, year = {2011}, issn = {1520-5363}, pages = {498-502}, doi = {http://doi.ieeecomputersociety.org/10.1109/ICDAR.2011.106}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - Document Analysis and Recognition, International Conference on TI - Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines SN - 1520-5363 SP498 EP502 A1 - Xiao-Rong Lin, A1 - Chien-Yang Guo, A1 - Fu Chang, PY - 2011 KW - bilingual document KW - component KW - decision-tree support vector machine KW - script and language identification VL - 0 JA - Document Analysis and Recognition, International Conference on ER - | |||
In this paper, we propose a method for classifying textual entities of bilingual documents written in Chinese and English. In contrast to earlier works that performed classification on the level of text lines or documents, we apply our method to the level of textual components, as we must first identify Chinese components before merging them into intact characters and sending the latter characters to a Chinese recognizer. To cope with a large training data set containing 365,672 samples, we employ a decision-tree support vector machine (DTSVM) method, which decomposes a given data space into small regions and trains local SVMs on those regions. By applying this method to train classifiers on various combinations of feature types, we were able to complete each training process within 3,500 seconds and achieve higher than 99.6% test accuracy in classifying a textual component into Chinese, alphanumeric, and punctuation. Moreover, the classification had no strong bias towards any of the three categories.
Index Terms:
bilingual document, component, decision-tree support vector machine, script and language identification
Citation:
Xiao-Rong Lin, Chien-Yang Guo, Fu Chang, "Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines," icdar, pp.498-502, 2011 International Conference on Document Analysis and Recognition, 2011
Usage of this product signifies your acceptance of the Terms of Use.
