loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA'06)
Identification of Document Language is Not yet a Completely Solved Problem
Sydney Australia
November 28-December 01
ISBN: 0-7695-2731-0
Joaquim Ferreira da Silva, Universidade Nova de Lisboa, Portugal
Gabriel Pereira Lopes, Universidade Nova de Lisboa, Portugal
Existing Language Identification (LID) approaches do reach 100% precision, in most common situations, when dealing with documents written in just one language, and when those documents are large enough. However, LID approaches do not provide a reliable solution for some situations: when there is need to discriminate the correct variant of the language used in a text, for example, European or Brazilian variants of Portuguese, UK or USA English variants, or any other language variants. Another hard context occur with small touristic advertisements on the web, addressing foreigners but using local language to name most local entities. In this paper, we present a fully statisticsbased LID approach which learns the most discriminant information according to each context, and identifies the correct language or language variant a text is written in. This methodology is shown to be correct for normal texts and maintains its robustness in hard LID contexts.
Citation:
Joaquim Ferreira da Silva, Gabriel Pereira Lopes, "Identification of Document Language is Not yet a Completely Solved Problem," cimca, pp.212, International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.