2014 28th International Conference on Advanced Information Networking and Applications Workshops (WAINA) (2014)
May 13, 2014 to May 16, 2014
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/WAINA.2014.56
National web archive that preserves national knowledge for generations to come has been successfully made available through a domain-specific web crawler for years. However, that kind of crawler still misses many foreign language web pages that are also related to the nation. In this paper, we propose a new crawling approach to collect national related web pages written in a foreign language, especially the English web pages that relate to Thailand. We have proposed a notion of website segment which groups the related web pages from their same longest directory paths. Rather than exploring a target web page as proposed in many traditional focused crawling approaches, we train an ensemble classifier with several features to predict the relevancy of the website segments. The most relevant website segments in the crawling frontier are then enqueued to download. Preliminary experiments on the real web space show that this approach can provide better promising harvest results than the Breadth-First and Best-First baselines for the Thai-tourism and Thai-estate topics.
website segment, topic-specific web crawler, language-specific web crawler, focused web crawler
A. Rungsawang, T. Suebchua and B. Manaskasemsak, "Thai Related Foreign Language-Specific Website Segment Crawler," 2014 28th International Conference on Advanced Information Networking and Applications Workshops (WAINA), BC, Canada, 2014, pp. 293-298.