27th Annual International Computer Software and Applications Conference
A Probabilistic Model for Intelligent Web Crawlers
Dallas, Texas
November 03-November 06
ISBN: 0-7695-2020-0
With the enormous growth of the World Wide Web in recent years, the issue of how to discover web pages efficiently has become an important challenge for web crawler designers. In this paper, we will outline a simple model to predict the distribution of the search depth in a breadth-first search to reach the first web pages relevant to a user query. We define this probability as the crawler confidence. Recent studies indicate that at a large scale the Web structure subscribes to power law distribution on several aspects [3][7]. However, our work tries to model a microscopic linkage structure of the Web from an intelligent crawler's point of view. With the information provided by crawler confidence, an intelligent crawler can adjust its crawling behavior to achieve a higher harvest rate.