loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
Scheduling Algorithms for Web Crawling
Ribeir?o Preto-SP, Brazil
October 12-October 15
ISBN: 0-7695-2237-8
Carlos Castillo, Universidad de Chile
Mauricio Marin, Universidad de Magallanes
Andrea Rodriguez, Universidad de Concepción
Ricardo Baeza-Yates, Universidad de Chile
This article presents a comparative study of strategies for Web crawling. We show that a combination of breadth-first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times.
Citation:
Carlos Castillo, Mauricio Marin, Andrea Rodriguez, Ricardo Baeza-Yates, "Scheduling Algorithms for Web Crawling," la-webmedia, pp.10-17, WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress, 2004
Usage of this product signifies your acceptance of the Terms of Use.