The Community for Technology Leaders
Web Congress, Joint Conference Brazilian Symposium on Multimedia and the Web & Latin America (2004)
Ribeir?o Preto-SP, Brazil
Oct. 12, 2004 to Oct. 15, 2004
ISBN: 0-7695-2237-8
pp: 10-17
Ricardo Baeza-Yates , Universidad de Chile
Andrea Rodriguez , Universidad de Concepción
Carlos Castillo , Universidad de Chile
Mauricio Marin , Universidad de Magallanes
ABSTRACT
This article presents a comparative study of strategies for Web crawling. We show that a combination of breadth-first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times.
INDEX TERMS
null
CITATION
Ricardo Baeza-Yates, Andrea Rodriguez, Carlos Castillo, Mauricio Marin, "Scheduling Algorithms for Web Crawling", Web Congress, Joint Conference Brazilian Symposium on Multimedia and the Web & Latin America, vol. 00, no. , pp. 10-17, 2004, doi:10.1109/WEBMED.2004.1348139
80 ms
(Ver 3.3 (11022016))