This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Unified Probabilistic Framework for Web Page Scoring Systems
January 2004 (vol. 16 no. 1)
pp. 4-16
Marco Gori, IEEE
Marco Maggini, IEEE Computer Society

Abstract—The definition of efficient page ranking algorithms is becoming an important issue in the design of the query interface of Web search engines. Information flooding is a common experience especially when broad topic queries are issued. Queries containing only one or two keywords usually match a huge number of documents, while users can only afford to visit the first positions of the returned list, which do not necessarily refer to the most appropriate answers. Some successful approaches to page ranking in a hyperlinked environment, like the Web, are based on link analysis. In this paper, we propose a general probabilistic framework for Web Page Scoring Systems (WPSS), which incorporates and extends many of the relevant models proposed in the literature. In particular, we introduce scoring systems for both generic (horizontal) and focused (vertical) search engines. Whereas horizontal scoring algorithms are only based on the topology of the Web graph, vertical ranking also takes the page contents into account and are the base for focused and user adapted search interfaces. Experimental results are reported to show the properties of some of the proposed scoring systems with special emphasis on vertical search.

[1] S. Lawrence and C.L. Giles, Searching the Web Science, vol. 281, no. 5374, p. 175, 1998.
[2] S. Lawrence and C.L. Giles, Accessibility of Information on the Web Nature, vol. 400, no. 8, pp. 107-109, 1999.
[3] M. Henzinger, Hyperlink Analysis for the Web IEEE Internet Computing, vol 5, no. 1, pp. 45-50, 2001.
[4] L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web technical report, Computer Science Dept., Stanford Univ., 1998.
[5] J.M. Kleinberg, Authoritative Sources in a Hyperlinked Environment J. ACM, vol. 46, no. 5, pp. 604-632, 1999.
[6] K. Bharat and M.R. Henzinger, Improved Algorithms for Topic Distillation in a Hyperlinked Environment Proc. 21st Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 104-111, 1998.
[7] R. Lempel and S. Moran, The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect Proc. Ninth World Wide Web Conf. (WWW9), pp. 387-401, 2000.
[8] R. Lempel and S. Moran, Salsa: The Stochastic Approach for Link-Structure Analysis ACM Trans. Information Systems, vol. 19, no. 2, pp. 131-160, 2001.
[9] D. Cohn and H. Chang, Learning to Probabilistically Identify Authoritative Documents Proc. 17th Int'l Conf. Machine Learning (ICML), pp. 167-174, 2000.
[10] D. Cohn and T. Hofmann, The Missing Link: A Probabilistic Model of Document Content and Hypertext Connectivity Advances in Neural Information Processing Systems 13, pp. 430-436, 2000.
[11] M. Richardson and P. Domingos, The Intelligent Surfer: Probabilistic Combination of Link and Content Information in Pagerank Advances in Neural Information Processing Systems 14, pp. 1441-1448, 2002.
[12] T. H. Haveliwala, Topic-Sensitive Pagerank Proc. 11th World Wide Web Conf. (WWW2002), pp. 517-526, 2002.
[13] M. Diligenti, M. Gori, and M. Maggini, Web Page Scoring Systems for Horizontal and Vertical Search Proc. 11th World Wide Web Conf. (WWW2002), pp. 508-516, 2002.
[14] G. Greco, S. Greco, and E. Zumpano, A Probabilistic Approach for Distillation and Ranking of Web Pages World Wide Web, vol. 4, no. 3, pp. 189-207, 2001.
[15] E. Seneta, Non-Negative Matrices and Markov Chains. Springer-Verlag, 1981.
[16] S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine Proc. Seventh World Wide Web Conf. (WWW7), pp. 107-117, 1998.
[17] M.M. Kessler, Bibliographic Coupling between Scientific Papers Am. Documentation, vol. 14, pp. 10-25, 1963.
[18] S. Chakrabarti, M. Joshi, and V. Tawde, Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks Proc. 24th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 208-216, 2001.
[19] S. Chakrabarti, M. Van der Berg, and B. Dom, Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery Proc. Eighth Int'l World Wide Web Conf. (WWW8), pp. 545-562, 1999.
[20] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg, Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text Proc. Seventh World Wide Web Conf. (WWW7), pp. 65-74, 1998.
[21] M. Diligenti, F. Coetzee, S. Lawrence, L. Giles, and M. Gori, Focus Crawling by Context Graphs Proc. 26th Int'l Conf. Very Large Databases (VLDB 2000), pp. 527-534, 2000.
[22] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997.
[23] B. Amento, L. Terveen, and W. Hill, Does Authority Mean Quality? Predicting Expert Quality Ratings of Web Documents Proc. 23rd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 296-303, 2000.

Index Terms:
Web page scoring systems, random walks, HITS, PageRank, focused PageRank.
Citation:
Michelangelo Diligenti, Marco Gori, Marco Maggini, "A Unified Probabilistic Framework for Web Page Scoring Systems," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, pp. 4-16, Jan. 2004, doi:10.1109/TKDE.2004.1264818
Usage of this product signifies your acceptance of the Terms of Use.