This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
PEBL: Web Page Classification without Negative Examples
January 2004 (vol. 16 no. 1)
pp. 70-81

Abstract—Web page classification is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious pre-processing such as collecting positive and negative training examples. For instance, in order to construct a “homepage” classifier, one needs to collect a sample of homepages (positive examples) and a sample of nonhomepages (negative examples). In particular, collecting negative training examples requires arduous work and caution to avoid bias. This paper presents a framework, called Positive Example Based Learning (PEBL), for Web page classification which eliminates the need for manually collecting negative training examples in preprocessing. The PEBL framework applies an algorithm, called Mapping-Convergence (M-C), to achieve high classification accuracy (with positive and unlabeled data) as high as that of a traditional SVM (with positive and negative data). M-C runs in two stages: the mapping stage and convergence stage. In the mapping stage, the algorithm uses a weak classifier that draws an initial approximation of “strong” negative data. Based on the initial approximation, the convergence stage iteratively runs an internal classifier (e.g., SVM) which maximizes margins to progressively improve the approximation of negative data. Thus, the class boundary eventually converges to the true boundary of the positive class in the feature space. We present the M-C algorithm with supporting theoretical and experimental justifications. Our experiments show that, given the same set of positive examples, the M-C algorithm outperforms one-class SVMs, and it is almost as accurate as the traditional SVMs.

[1] H. Chen, C. Schuffels, and R. Orwig, Internet Categorization and Search: A Machine Learning Approach J. Visual Comm. and Image Representation, vol. 7, pp. 88-102, 1996.
[2] H. Mase, Experiments on Automatic Web Page Categorization for IR System technical report, Stanford Univ., Stanford, Calif., 1998.
[3] E. Glover, G. Flake, S. Lawrence, W.P. Birmingham, A. Kruger, C.L. Giles, and D. Pennock, Improving Category Specific Web Search by Learning Query Modifications Proc. 2001 Symp. Applications and the Internet (SAINT 2001) pp. 23-31, 2001.
[4] A. Kruger, C.L. Giles, and E. Glover, Deadliner: Building a New Niche Search Engine Proc. Ninth Int'l Conf. Information and Knowledge Management (CIKM '00), pp. 272-281, 2000.
[5] E.N. Mayoraz, Multiclass Classification with Pairwise Coupled Neural Networks or Support Vector Machines Proc. Int'l Conf. Artificial Neural Network (ICANN '01), pp. 314-321, 2001.
[6] E.L. Allwein, R.E. Schapire, and Y. Singer, Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers J. Machine Learning Research, vol. 1, pp. 113-141, 2000.
[7] S. Dumais and H. Chen, Hierarchical Classification of Web Content Proc. 23rd ACM Int'l Conf. Research and Development in Information Retrieval (SIGIR '00), pp. 256-263, 2000.
[8] H. Yu, J. Han, and K.C.-C. Chang, PEBL: Positive-Example Based Learning for Web Page Classification Using SVM Proc. Eighth Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), pp. 239-248, 2002.
[9] F. Letouzey, F. Denis, and R. Gilleron, Learning from Positive and Unlabeled Examples Proc. 11th Int'l Conf. Algorithmic Learning Theory (ALT '00), 2000.
[10] F. DeComite, F. Denis, and R. Gilleron, Positive and Unlabeled Examples Help Learning Proc. 11th Int'l Conf. Algorithmic Learning Theory (ALT '99), pp. 219-230, 1999.
[11] C. Cortes and V. Vapnik, Support Vector Networks Machine Learning, vol. 30, no. 3, pp. 273-297, 1995.
[12] W. Wong and A.W. Fu, Finding Structure and Characteristics of Web Documents for Classification Proc. 2000 ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD '00), pp. 96-105, 2000.
[13] J. Yi and N. Sundaresan, A Classifier for Semi-Structured Documents Proc. Sixth Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 340-344, 2000.
[14] H. Oh, S. Myaeng, and M. Lee, A Practical Hypertext Categorization Method Using Links and Incrementally Available Class Information Proc. 23rd ACM Int'l Conf. Research and Development in Information Retrieval (SIGIR '00), pp. 264-271, 2000.
[15] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data Via the EM Algorithm J. Royal Statistical Soc., Series B, vol. 39, pp. 1-38, 1977.
[16] K. Nigam, Text Classification from Labeled and Unlabeled Documents Using EM Machine Learning, vol. 39, pp. 103-134, 2000.
[17] T. Joachims, Transductive Inference for Text Classification Using Support Vector Machines Proc. 16th Int'l Conf. Machine Learning (ICML '00), pp. 200-209, 1999.
[18] B. Liu, W.S. Lee, P.S. Yu, and X. Li, Partially Supervised Classification of Text Documents Proc. 19th Int'l Conf. Machine Learning (ICML '02), pp. 8-12, 2002.
[19] F. Denis, PAC Learning from Positive Statistical Queries Proc. 10th Int'l Conf. Algorithmic Learning Theory (ALT '99), pp. 112-126, 1998.
[20] A. Frosini, M. Gori, and P. Priami, "A Neural Network-Based Model for Paper Currency Recognition and Verification," IEEE Trans. Neural Networks, vol. 7, pp. 1,482-1,490, Nov. 1996.
[21] M. Gori, L. Lastrucci, and G. Soda, Autoassociator-Based Models for Speaker Verification Pattern Recognition Letters, vol. 17, pp. 241-250, 1995.
[22] L.M. Manevitz and M. Yousef, One-Class SVMs for Document Classification J. Machine Learning Research, vol. 2, pp. 139-154, 2001.
[23] S.M. Bileschi and B. Heisele, Advances in Component-Based Face Detection Proc. 2002 First Int'l Workshop Pattern Recognition with Support Vector Machines, pp. 135-143, 2002.
[24] D.M.J. Tax and R.P.W. Duin, Uniform Object Generation for Optimizating One-Class Classifiers J. Machine Learning Research, vol. 2, pp. 155-173, 2001.
[25] T. Joachims, Text Categorization with Support Vector Machines Proc. 10th European Conf. Machine Learning (ECML '98), pp. 137-142, 1998.
[26] Y. Yang and Y. Lui, A Re-Examination of Text Categorization Methods Proc. 22nd ACM Int'l Conf. Research and Development in Information Retrieval (SIGIR '99), pp. 42-49, 1999.
[27] D.M.J. Tax and R.P.W. Duin, Support Vector Domain Description Pattern Recognition Letters. vol. 20, pp. 1991-1999, 1999.
[28] B. Scholkopf, R.C. Williamson, A.J. Smola, and J. Shawe-Taylor, SV Estimation of a Distribution's Support Proc. 14th Neural Information Processing Systems (NIPS '00), pp. 582-588, 2000.
[29] M. Craven, D. Dipasquo, and D. Freitag, Learning to Extract Symbolic Knowledge from the World Wide Web Proc. 15th Conf. Am. Assoc. for Artificial Intelligence (AAAI '98), pp. 509-516, 1998.

Index Terms:
Web page classification, Web mining, document classification, single-class classification, Mapping-Convergence (M-C) algorithm, SVM (Support Vector Machine).
Citation:
Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang, "PEBL: Web Page Classification without Negative Examples," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, pp. 70-81, Jan. 2004, doi:10.1109/TKDE.2004.1264823
Usage of this product signifies your acceptance of the Terms of Use.