Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on (2011)
Aug. 22, 2011 to Aug. 27, 2011
This paper proposes a novel text representation for Web pages written in Vietnamese. This representation is based on an analysis of Vietnamese documents at phonetic level in which each document will be represented as a bag of phonemes. It is designed to capture sound-based information in documents and to be helpful for resolving some non-topic text classification problems including automatic Vietnamese language identification of a document, ancient Vietnamese document detection, author identification, and poem identification. We apply some typical machine learning methods including NB, KNN and SVMs to build text classifiers. The experimental results show a significant improvement in terms of effectiveness and efficiency compared to the traditional syllable based representation in most cases.
Document representation, Classification
Xiaoying Gao, Peter Andreae, Giang-Son Nguyen, "Phoneme Based Representation for Vietnamese Web Page Classification", Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on, vol. 01, no. , pp. 15-22, 2011, doi:10.1109/WI-IAT.2011.142