2015 International Conference on Big Data and Smart Computing (BigComp) (2015)
Jeju, South Korea
Feb. 9, 2015 to Feb. 11, 2015
Seonggyu Lee , Division of Web Science and Technology, KAIST Daejeon, South Korea
Jinho Kim , Division of Web Science and Technology, KAIST Daejeon, South Korea
Sung-Hyon Myaeng , Division of Web Science and Technology, Department of Computer Science, KAIST, Daejeon, South Korea
Text classification has become a critical step in big data analytics. For supervised machine learning approaches to text classification, availability of sufficient training data with classification labels attached to individual text units is essential to the performance. Since labeled data are usually scarce, however, it is always desirable to devise a semi-supervised method where unlabeled data are used in addition to labeled ones. A solution is to apply a latent factor model to generate clustered text features and use them for text classification. The main thrust of the current research is to extend Latent Dirichlet Allocation (LDA) for this purpose by considering word weights in sampling and maintaining balances of topic distributions. A series of experiments were conducted to evaluate the proposed method for classification tasks. The result shows that the topic distributions generated by the balance weighted topic modeling method add some discriminative power to feature generations for classification.
Text categorization, Training data, Feature extraction, Training, Data models, Resource management, Vocabulary
S. Lee, J. Kim and S. Myaeng, "An extension of topic models for text classification: A term weighting approach," 2015 International Conference on Big Data and Smart Computing (BigComp)(BIGCOMP), Jeju, South Korea, 2015, pp. 217-224.