This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2011 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies
Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction
Lyon France
August 22-August 27
ISBN: 978-0-7695-4513-4
The use of the World Wide Web as a free source for large linguistic resources is a well-established idea. Such resources are keystones to domains such as lexicon-based categorization, information retrieval, machine translation and information extraction. In this paper, we present an industrial focused web crawler for the automatic compilation of specialized corpora from the web. This application, created within the framework of the TTC project, is used daily by several linguists to bootstrap large thematic corpora which are then used to automatically generate bilingual terminologies.
Index Terms:
focused crawling, web-as-corpus, resources bootstrapping
Citation:
Clément de Groc, "Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction," wi-iat, vol. 1, pp.497-498, 2011 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies, 2011
Usage of this product signifies your acceptance of the Terms of Use.