Issue No. 06 - November/December (2004 vol. 19)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIS.2004.62
Ismail Seng? Alting?vde , Bilkent University
?zg? Ulusoy , Bilkent University
A focused crawler is an agent that concentrates on a particular target topic and tries to visit and gather only relevant pages from the Web. A crucial issue for a focused crawler is the underlying heuristic for deciding the page to visit next. The authors propose a rule-based approach to improve a baseline focused crawler's harvest rate and coverage. The baseline focused crawler employs a canonical topic taxonomy to train a na?ve-Bayesian classifier, which then helps score unseen URLs. The authors explore using simple rules derived from interclass (topic) linkage patterns to decide the crawler's next move. The rule-based approach also enhances the baseline crawler in supporting tunneling. In initial performance results, the rule-based crawler improved the harvest rate and coverage of the baseline crawler.
focused Web crawling, tunneling, rule extraction, Web mining, na?ve Bayesian classification
I. S. Alting?vde and ?. Ulusoy, "Exploiting Interclass Rules for Focused Crawling," in IEEE Intelligent Systems, vol. 19, no. , pp. 66-73, 2004.