loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Sixth IEEE International Conference on Data Mining (ICDM'06)
Subjectivity Categorization of Weblog with Part-of-Speech Based Smoothing
Hong Kong
December 18-December 22
ISBN: 0-7695-2701-9
Shen Huang, Microsoft Research Asia, China
Jian-Tao Sun, Microsoft Research Asia, China
Xuanhui Wang, University of Illinois at Urbana-Champaign, USA
Hua-Jun Zeng, Microsoft Research Asia, China
Zheng Chen, Microsoft Research Asia, China
Experts from different domains try to mine users? comments on weblogs for different reasons such as politics or commerce. All these needs necessitate automatically distinguishing subjective weblog contents from objective ones, namely subjectivity categorization. Since weblogs contain various topics from different domains, limited training data can hardly cover all the topics and "unseen words" becomes a serious problem for categorization tasks. In this paper, Part-Of-Speech (POS) based smoothing is proposed to alleviate the "unseen words" problem. In conjunction with a na?ve Bayes model constructed from limited training data, the probability of an unseen word in a new domain can be well smoothed by the probability of its POS result. Empirical studies on five datasets show that our approach consistently outperforms the basic na?ve Bayes with Laplace smoothing. In a cross-domain experiment, our approach achieves 22.0% improvement in Macro F1 and 24.4% in Micro F1 over basic na?ve Bayes. These verify that POS based smoothing can indeed benefit subjectivity categorization, especially in the cases with a large number of unseen words.
Citation:
Shen Huang, Jian-Tao Sun, Xuanhui Wang, Hua-Jun Zeng, Zheng Chen, "Subjectivity Categorization of Weblog with Part-of-Speech Based Smoothing," icdm, pp.285-294, Sixth IEEE International Conference on Data Mining (ICDM'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.