International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2
A New Framework for Uncertainty Sampling: Exploiting Uncertain and Positive-Certain Examples in Similarity-Based Text Classification
Las Vegas, Nevada
April 05-April 07
ISBN: 0-7695-2108-8
One of the major concerns with supervised learning approaches to text classification is that they require a large number of labeled examples to achieve a high level of effectiveness. Labeling such a large number of examples poses a considerable burden on human experts. Two common approaches to reduce the amount of labeled examples required are: (1) selecting informative uncertain examples for human-labeling and (2) using many inexpensive unlabeled data with a small number of labeled examples. While previous work in text classification focused only on one approach, we investigate a framework to combine both approaches in similarity-based text classification. By applying our new thresholding strategy (RinSCut) to uncertainty sampling, we propose a new framework which automatically selects informative uncertain data that should be presented to human expert for labeling and positive-certain data that are directly used for learning without human-labeling. With our similarity-based learning algorithm (KAN), experiments have been conducted on Reuters-21578 data set. Our proposed scheme has been compared with random sampling and previous conventional uncertainty sampling, based on micro and macro-averaged F1. The results showed that if both macro and micro-averaged measures are concerned, the optimal choice might be our framework.
Citation:
Kang H. Lee, Byeong H. Kang, "A New Framework for Uncertainty Sampling: Exploiting Uncertain and Positive-Certain Examples in Similarity-Based Text Classification," itcc, vol. 2, pp.474, International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2, 2004