This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
THESUS, a Closer View on Web Content Management Enhanced with Link Semantics
June 2004 (vol. 16 no. 6)
pp. 685-700

Abstract—With the unstoppable growth of the World Wide Web, the great success of Web Search Engines, such as Google and Alta-Vista, users now turn to the Web whenever looking for information. However, many users are neophytes when it comes to computer science, yet they are often specialists of a certain domain. These users would like to add more semantics to guide their search through World Wide Web material, whereas currently most search features are based on raw lexical content. We show in this paper how the use of the incoming links of a page can be used efficiently to classify a page in a concise manner. This enhances the browsing and querying of Web pages. In this article, we focus on the tools needed in order to manage the links and their semantics. We further process these links using a hierarchy of concepts, akin to an ontology, and a thesaurus. This work is demonstrated by an prototype system, called THESUS, that organizes thematic Web documents into semantic clusters. Our contributions in this paper are the following: 1) a model and language to exploit link semantics information, 2) the THESUS prototype system, 3) its innovative aspects and algorithms, more specifically, the novel similarity measure between Web documents applied to different clustering schemes (DB-Scan and COBWEB), and 4) a thorough experimental evaluation proving the value of our approach.

[1] R. Al-Halami, R. Berwick, WordNet, An Electronic Lexical Database, C. Fellbaum and G. Miller, eds. Cambridge, Mass.: MIT Press-Bradford Books, 1998.
[2] R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell, Web Watcher: A Learning Apprentice for the World Wide Web Proc. AAAI Spring Symp. Information Gathering from Heterogeneous, Distributed Environments, pp. 6-13, Mar. 1995.
[3] G. Arocena, A. Mendelzon, and G. Mihaila, Applications of a Web Query Language Proc. Sixth Int'l World Wide Web Conf., Apr. 1997.
[4] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Graph Structure in the Web Proc. Ninth Int'l World Wide Web Conf., May 2000.
[5] S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine Proc. Seventh Int'l World Wide Web Conf., Apr. 1998.
[6] S. Chakrabarti, M. Berg, and B. Dom, Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery Proc. Eighth Int'l World Wide Web Conf., May 1999.
[7] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan, Automatic Resource List Compilation by Analyzing Hyperlink Structure and Associated Text Proc. Seventh Int'l World Wide Web Conf., 1998.
[8] S. Chakrabati, B. Dom, D. Gibson, J. Kleinberg, S. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Mining the Link Structure of the World Wide Web Computer, vol. 32, no. 8, pp. 60-67, Aug. 1999.
[9] The DARPA Agent Markup Language Ontology Library,http://www.daml.orgontologies/, 2004.
[10] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proc. Second Int'l Conf. Knowledge Discovery and Data Mining ACM-SIGKDD, 1996.
[11] T. Eiter and H. Mannila, Distance Measures for Point Sets and Their Computation Acta Informatica J., vol. 34, 1997.
[12] D. Fisher, Knowledge Acquisition Via Incremental Conceptual Clustering Machine Learning, vol. 2, pp. 139-172, 1987.
[13] A. Gionis, D. Gunopulos, and N. Koudas, Efficient and Tunable Similar Set Retrieval Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 247-258, May 2001.
[14] The Google.com search engine,http:/www.google.com/, 2004.
[15] N. Guarino, Formal Ontology and Information Systems Proc. First Int'l Conf. Formal Ontologies in Information Systems (FOIS '98), pp. 3-15, June 1998.
[16] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, On Clustering Validation Techniques J. Intelligent Information Systems (JIIS), vol. 17, nos. 2-3, pp. 107-145, 2001.
[17] M. Halkidi, B. Nguyen, I. Varlamis, and M. Vazirgiannis, THESUS: Organizing Web Document Collections Based on Semantics Very Large DataBases J., special edition on semantic Web, to appear.
[18] M. Halkidi and M. Vazirgiannis, A Data Set Oriented Approach for Clustering Algorithm Selection Proc. Principles of Data Mining and Knowledge Discovery, Fifth European Conf. (PKDD '01), pp. 165-179, 2001.
[19] M. Henzinger, Hyperlink Analysis for the Web IEEE Internet Computing, vol 5, no. 1, pp. 45-50, 2001.
[20] T. Haveliwala, A. Gionis, and P. Indyk, Scalable Techniques for Clustering the Web Proc. WebDB Workshop, May 2000.
[21] M.M. Kessler, Bibliographic Coupling between Scientific Papers Am. Documentation, vol. 14, no. 1, 1963.
[22] J. Kleinberg, Authoritative Sources in a Hyperlinked Environment J. ACM, vol. 46, no. 5, pp. 604-632, Sept. 1999.
[23] The Kartoo System,http:/www.kartoo.fr, 2004.
[24] I. Niiniluoto, Truthlikeness. Dordrecht, Holland, D. Reidel Publishers Company, 1987.
[25] The Northern Line Search Engine,http:/www.northernlight. com, 2004.
[26] B. Nguyen, I. Varlamis, M. Halkidi, and M. Vazirgianis, Construction de Classes de Documents Web Premieres Journees Francophones de la Toile, June-July 2003.
[27] ODP Open Directory Project,http:/dmoz.org/, 2004.
[28] T. Phelps and R. Wilensky, Robust Hyperlinks Cost Just Five Words Each UC Berkeley CS Technical Report UCB//CSD-00-1091, Berkeley, Cailf., 2000.
[29] H. Small, Co-Citation in the Scientific Literature: A New Measure of the Relationship between Two Documents Am. Soc. for Information Science, vol. 24, pp. 265-269, 1973.
[30] C.E. Shannon, A Mathematical Theory of Communication Bell System Technical J., vol. 27, pp. 379-423 and 623-656, July and Oct., 1948.
[31] M. Steinbach, G. Karypis, and V. Kumar, A Comparison of Document Clustering Techniques Proc. KDD Workshop Text Mining, 2000.
[32] Vivisimo search engine:http:/www.vivisimo.com/, 2004.
[33] I. Varlamis and M. Vazirgiannis, Web Document Searching Using Enhanced Hyperlink Semantics Based on XML Proc. Int'l Database Eng.&Applications Symp., (IDEAS '01), pp. 34-43, 2001.
[34] WordNet Web site,http://www.cogsci.princeton.edu~wn/, 2004.
[35] Z. Wu and M. Palmer, Verb Semantics and Lexical Selection Proc. 32nd Ann. Meetings of the Assoc. for Computational Linguistics, pp. 133-138, June 1994.
[36] O. Zamir and O. Etzioni, Web Document Clustering: A Feasibility Demonstration Proc. 21st Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 46-54, Aug. 1998.

Index Terms:
World Wide Web, link analysis and management, semantic Web.
Citation:
Iraklis Varlamis, Michalis Vazirgiannis, Maria Halkidi, Benjamin Nguyen, "THESUS, a Closer View on Web Content Management Enhanced with Link Semantics," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 6, pp. 685-700, June 2004, doi:10.1109/TKDE.2004.16
Usage of this product signifies your acceptance of the Terms of Use.