This Article 
 Bibliographic References 
 Add to: 
WebGuard: A Web Filtering Engine Combining Textual, Structural, and Visual Content-Based Analysis
February 2006 (vol. 18 no. 2)
pp. 272-284
Liming Chen, IEEE Computer Society
Along with the ever-growing Web comes the proliferation of objectionable content, such as sex, violence, racism, etc. We need efficient tools for classifying and filtering undesirable Web content. In this paper, we investigate this problem and describe WebGuard, an automatic machine learning-based pornographic Web site classification and filtering system. Unlike most commercial filtering products, which are mainly based on textual content-based analysis such as indicative keywords detection or manually collected black list checking, WebGuard relies on several major data mining techniques associated with textual, structural content-based analysis, and skin color related visual content-based analysis as well. Experiments conducted on a testbed of 400 Web sites including 200 adult sites and 200 nonpornographic ones showed WebGuard's filtering effectiveness, reaching a 97.4 percent classification accuracy rate when textual and structural content-based analysis was combined with visual content-based analysis. Further experiments on a black list of 12,311 adult Web sites manually collected and classified by the French Ministry of Education showed that WebGuard scored a 95.62 percent classification accuracy rate. The basic framework of WebGuard can apply to other categorization problems of Web sites which combine, as most of them do today, textual and visual content.

[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification of Regression Trees. Wadsworth, 1984.
[2] P. Gralla and S. Kinkoph, Internet et les Enfants, p. 74, CampusPress, 2001.
[3] M. Hammami, Y. Chahir, L. Chen, and D. Zighed, “Détection des Régions de Couleur de Peau dans l'Image,” revue RIA-ECA, vol. 17, pp. 219-231, 2003.
[4] M. Hammami, L. Chen, D. Zighed, and Q. Song, “Définition d'un Modèle de Peau et son Utilisation pour la Classification des Images,” pp. 186-197, June 2002.
[5] M. Hammami, D. Tsishkou, and L. Chen, “Data-Mining Based Skin-Color Modeling and Applications,” Proc. Third Int'l Workshop Content-Based Multimedia Indexing, pp. 157-162, 2003.
[6] M. Hammami, Y. Chahir, and L. Chen, “WebGuard: Web Based Adult Content Detection and Filtering System” Proc. 2003 IEEE/WIC Int'l Conf. Web Inteligence, pp. 574-578, 2003.
[7] J.R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81-106, 1986.
[8] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[9] S.M. Weiss and C.A. Kulikowski, Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991.
[10] Recreational Software Advisory Council on the Internet that became the Internet Content Rating Association (ICRA) in 1999,, 2005.
[11] Cybersitter 2002 Copyright © 1995-2003, Solid Oak Software, Inc., All Rights Reserved.,, 2002.
[12] Net Nanny 4.04 Copyright © 2002-2003 BioNet Systems, LLC., All Rights Reserved.,, 2005.
[13] Norton Internet Security 2003 © 1995-2003 Symantec Corporation., All Rights Reserved,, 2005.
[14] Puresight Home 1.6 iCognito Technologies Ltd., www.icognito. com, 2005.
[15] Cyber Patrol 5.0 © 2003 SurfControl plc., All Rights Reserved,, 2005.
[16] D.A. Zighed and R. Rakotomala, “A Method for Non Arborescent Induction Graphs,” technical report, Laboratory ERIC, Univ. of Lyon2, 1996.
[17] A. Albiol, L. Torres, C.A. Bouman, and E.J. Delp, “A Simple and Efficient Face Detection Algorithm for Video Database Applications,” Proc. IEEE Int'l Conf. Image Processing, vol. 2, pp. 239-242, 2000.
[18] H. Wang and S.-F. Chang, “A Highly Efficient System for Automatic Face Region Detection in MPEG Video,” IEEE Trans. Circuits and System for Video Technology, vol. 7, no. 4, pp. 615-628, Aug. 1997.
[19] J.G. Wang and E. Sung, “Frontal-View Face Detection and Facial Feature Extraction Using Color and Morphological Operators,” Pattern Recognition Letters, vol. 20, no. 10, pp. 1053-1068, Oct. 1999.
[20] M.-H. Yang and N. Ahuja, “Detecting Human Faces in Color Images,” Proc. Int'l Conf. Image Processing, pp. 127-130, 1998.
[21] M.J. Jones and J.M. Regh, “Statistical Color Models with Application to Skin Detection,” Cambridge Research Laboratory, CRL 98/11, 1998.
[22] S. Schüpp, Y. Chahir, A. Elmoataz, and L. Chen, “Détection et Extraction Automatique de Texte dans une Vidéo: Une Approche par Morphologie Mathématique,” pp. 73-82, MediaNet2002, 2002.
[23] P.Y. Lee, S.C. Hui, and A.C.M. Fong, “Neural Networks for Web Content Filtering,“ IEEE Intelligent Systems, pp. 48-57, Sept./Oct. 2002.
[24] S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced Hypertext Categorization Using Hyperlinks,” Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, 1998.
[25] G.W. Flake, K. Tsioutsiouliklis, and L. Zhukov, “Methods for Mining Web Communities: Bibliometric, Spectral, and Flow,” Web Dynamics, A. Poulovassilis and M. Levene, eds. Springer Verlag, 2003.
[26] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Proc. WWW7, 1998.
[27] J. Cho, H. Garcia-Molina, and L. Page, “Efficient Crawling through URL Ordering,” Computer Networks and ISDN Systems, vol. 30, nos. 1-7, pp. 161-172, 1998.
[28] G.W. Flake, S. Lawrence, and C.L. Giles, “Efficient Identification of Web Communities,” Proc. Sixth Int'l Conf. Knowledge Discovery and Data Mining (ACM SIGKDD-2000), 2000.
[29] Y. Yang, S. Slattery, and R. Ghani, “A Study of Approaches to Hypertext Categorization,” J. Intelligent Information Systems, 2001.
[30] J. Fürnkranz, “Exploiting Structural Information for Text Classification on the WWW,” Intelligent Data Analysis, pp. 487-498, 1999.
[31] G. Attardi, A. Gulli, and F. Sebastiani, “Automatic Web Page Categorization by Link and Context Analysis,” Proc. THAI-99, First European Symp. Telematics, Hypermedia, and Artificial Intelligence, C. Hutchison and G. Lanzarone, eds., pp. 105-119, 1999.
[32] E.J. Glover, K. Tsioutsiouliklis, S. Lawrence, D.M. Pennock, and G.W. Flake, “Using Web Structure for Classifying and Describing Web Pages,” Proc. WWW2002, 2002.
[33] B. Stayrynkevitch et al., “Poesia Software Architecture Definition Document,” technical report, Poesia Consortium, Dec. 2002.
[34] E. Karpova, D. Tsishkou, and L. Chen, “The ECL Skin-Color Images from Video (SCIV) Database,” Proc. IAPR Int'l Conf. Image and Signal Processing (ICISP '03), pp. 47-52, 2003.
[35] Y. Chahir and L. Chen, “Efficient Content-Based Image Retrieval Based on Color Homogeneous Objects Segmentation and Their Spatial Relationship Characterization,” J. Visual Comm. and Image Representation, vol. 11, no. 1, pp. 302-326, Mars 2000.
[36] W. Mahdi, M. Ardebilian, L. Chen, “Text Detection and Localization within Images,” PCT/ FR03/ 02406, 31 Juillet 2002.
[37] B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Chapman and Hall, 1993.

Index Terms:
Index Terms- Web classification and categorization, data mining, Web textual and structural content, visual content analysis, skin color model, pornographic Web site filtering.
Mohamed Hammami, Youssef Chahir, Liming Chen, "WebGuard: A Web Filtering Engine Combining Textual, Structural, and Visual Content-Based Analysis," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 2, pp. 272-284, Feb. 2006, doi:10.1109/TKDE.2006.34
Usage of this product signifies your acceptance of the Terms of Use.