Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on (2011)
Aug. 22, 2011 to Aug. 27, 2011
This paper proposes a novel text representation for Web pages written in Vietnamese. This representation is based on an analysis of Vietnamese documents at phonetic level in which each document will be represented as a bag of phonemes. It is designed to capture sound-based information in documents and to be helpful for resolving some non-topic text classification problems including automatic Vietnamese language identification of a document, ancient Vietnamese document detection, author identification, and poem identification. We apply some typical machine learning methods including NB, KNN and SVMs to build text classifiers. The experimental results show a significant improvement in terms of effectiveness and efficiency compared to the traditional syllable based representation in most cases.
Document representation, Classification
X. Gao, P. Andreae and G. Nguyen, "Phoneme Based Representation for Vietnamese Web Page Classification," 2011 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies(WI-IAT), Lyon, 2011, pp. 15-22.