The Community for Technology Leaders
Web Intelligence, IEEE / WIC / ACM International Conference on (2006)
Hong Kong, China
Dec. 18, 2006 to Dec. 22, 2006
ISBN: 0-7695-2747-7
pp: 680-686
Carlos Castillo , Universita di Roma "La Sapienza", Italy
Alberto Nelli , Universita di Roma "La Sapienza", Italy
Alessandro Panconesi , Universita di Roma "La Sapienza", Italy
ABSTRACT
Search engines rely on Web crawlers to create an index of the Web. Web crawlers explore the Web downloading pages and finding links to new pages to be explored. At any given moment, there are a number of pages waiting to be downloaded in the crawler queue. We study the growth of this queue of pending pages during a crawl of a large subset of the Web. In a normal breadth-first crawler, the queue quickly grows very large. <p>We present a strategy for managing the pending queue that reduces its maximum size by 50% while preserving the coverage and quality of the pages visited. This can be applied to general purpose Web crawlers as well as topic-specific crawling, peer-to-peer search, on-demand Web crawling, and other environments in which memory usage has to be kept to a minimum.</p>
INDEX TERMS
null
CITATION

A. Panconesi, A. Nelli and C. Castillo, "A Memory-Efficient Strategy for Exploring the Web," 2006 IEEE/WIC/ACM International Conference on Web Intelligence(WI), Hong Kong, 2006, pp. 680-686.
doi:10.1109/WI.2006.18
94 ms
(Ver 3.3 (11022016))