This Article 
 Bibliographic References 
 Add to: 
Text Classification without Negative Examples Revisit
January 2006 (vol. 18 no. 1)
pp. 6-20
Traditionally, building a classifier requires two sets of examples: positive examples and negative examples. This paper studies the problem of building a text classifier using positive examples (P) and unlabeled examples (U). The unlabeled examples are mixed with both positive and negative examples. Since no negative example is given explicitly, the task of building a reliable text classifier becomes far more challenging. Simply treating all of the unlabeled examples as negative examples and building a classifier thereafter is undoubtedly a poor approach to tackling this problem. Generally speaking, most of the studies solved this problem by a two-step heuristic: First, extract negative examples (N) from U. Second, build a classifier based on P and N. Surprisingly, most studies did not try to extract positive examples from U. Intuitively, enlarging P by P' (positive examples extracted from U) and building a classifier thereafter should enhance the effectiveness of the classifier. Throughout our study, we find that extracting P' is very difficult. A document in U that possesses the features exhibited in P does not necessarily mean that it is a positive example, and vice versa. The very large size of and very high diversity in U also contribute to the difficulties of extracting P'. In this paper, we propose a labeling heuristic called PNLH to tackle this problem. PNLH aims at extracting high quality positive examples and negative examples from U and can be used on top of any existing classifiers. Extensive experiments based on several benchmarks are conducted. The results indicated that PNLH is highly feasible, especially in the situation where |P| is extremely small.

[1] D. Bennett and A. Demiritz, “Semi-Supervised Support Vector Machines,” Advances in Neural Information Processing Systems, vol. 11, 1998.
[2] J. Bockhorst and M. Craven, “Exploiting Relations Among Concepts to Acquire Weakly Labeled Training Data,” Proc. 19th Int'l Conf. Machine Learning, 2002.
[3] P. Bradley and U. Fayyad, “Refining Initial Points for k-Means Clustering,” Proc. 15th Int'l Conf. Machine Learning, 1998.
[4] D.R. Cutting, D.R. Karger, J.O. Pederson, and J.W. Tukey, “Scatter/Gather a Cluster-Based Approach to Browsing Large Document Collections,” Proc. 15th Int'l Conf. Research and Development in Information Retrieval, 1992.
[5] G.P.C. Fung, J.X. Yu, H. Lu, and P.S. Yu, “Text Classification without Negative Examples,” Proc. 21st Int'l Conf. Data Eng., 2005.
[6] R. Ghani, “Combining Labeled and Unlabeled Data for Multiclass Text Categorization,” Proc. 19th Int'l Conf. Machine Learning, 2002.
[7] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning, 1998.
[8] B. Larsen and C. Aone, “Fast and Effective Text Mining Using Linear-Time Document Clustering,” Proc. Fifth Int'l Conf. Knowledge Discovery and Data Mining, 1999.
[9] W.S. Lee and B. Liu, “Learning with Positive and Unlabeled Examples Using Weighted Logistics Regression,” Proc. 20th Int'l Conf. Machine Learning, 2003.
[10] X. Li and B. Liu, “Learning to Classify Texts Using Positive and Unlabeled Data,” Proc. 2003 Int'l Joint Conf. Artificial Intelligence, 2003.
[11] B. Liu, Y. Dai, X. Li, W.S. Lee, and P.S. Yu, “Building Text Classifiers Using Positive and Unlabeled Examples,” Proc. Third Int'l Conf. Data Mining, 2003.
[12] B. Liu, X. Li, W.S. Lee, and P.S. Yu, “Text Classification by Labeling Words,” Proc. 19th Nat'l Conf. Artificial Intelligence, 2004.
[13] B. Liu, P.S. Yu, and X. Li, “Partially Supervised Classification of Text Documents,” Proc. 19th Int'l Conf. Machine Learning, 2002.
[14] A. McCallum and K. Nigam, “A Comparison of Event Models for Naive Bayes Text Classification,” Proc. 15th Nat'l Conf. Artificial Intelligence Workshop Learning for Text Categorization, 1998.
[15] D.C. Montogomery and G.C. Runger, Applied Statistics and Probability for Engineers, second ed. John Wiley & Sons, Inc., 1999.
[16] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning, vol. 39, 2000.
[17] J. Rocchio, “Relevance Feedback Information Retrieval,” The Smart Retrieval System— Experiments in Automatic Document Processing, G. Salton, ed., pp. 313-323. Prentice Hall, 1971.
[18] G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988.
[19] H. Schutze, D.A. Hull, and J.O. Pedersen, “A Comparison of Classifiers and Document Representations for the Routing Problem,” Proc. 18th Int'l Conf. Research and Development in Information Retrieval, 1995.
[20] F. Seabastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[21] Y. Yang, “A Study on Thresholding Strategies for Text Categorization,” Proc. 24th Int'l Conf. Research and Development in Information Retrieval, 2001.
[22] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. 22nd Int'l Conf. Research and Development in Information Retrieval, 1999.
[23] H. Yu, K.C.-C. Chang, and J. Han, “Heterogeneous Learner for Web Page Classification,” Proc. Second Int'l Conf. Data Mining, 2003.
[24] H. Yu, J. Han, and K.C.-C. Chang, “PEBL: Positive Example Based Learning for Web Page Classification Using SVM,” Proc. Ninth Int'l Conf. Knowledge Discovery and Data Mining, 2003.
[25] H. Yu, J. Han, and K.C.-C. Chang, “PEBL: Web Page Classification without Negative Examples,” IEEE Trans. Knowledge and Data Eng., vol. 16, 2004.
[26] T. Zhang, “The Value of Unlabeled Data for Classification Problems,” Proc. 17th Int'l Conf. Machine Learning, 2000.

Index Terms:
Index Terms- Data mining, text categorization, partially supervised learning, labeling unlabeled data.
Gabriel Pui Cheong Fung, Jeffrey X. Yu, Hongjun Lu, Philip S. Yu, "Text Classification without Negative Examples Revisit," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 6-20, Jan. 2006, doi:10.1109/TKDE.2006.16
Usage of this product signifies your acceptance of the Terms of Use.