loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2008 Second Asia International Conference on Modelling & Simulation
Language Identifications of Arabic Script Web Documents Using Independent Component Analysis
May 13-May 15
ISBN: 978-0-7695-3136-6
We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script web documents for web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles usingsingular value decomposition (SVD). The SVD has beenused to remove the noises on the documents retrieved beforeapplying the ICA for topic extraction. We assume that thetopic on each document is independent from each other. Wehave used the information retrieval measures that are precision, recall and F1 in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.
Index Terms:
ICA, language identifications, web documents, class profile based features
Citation:
Ali Selamat, Zhi-Sam Lee, "Language Identifications of Arabic Script Web Documents Using Independent Component Analysis," ams, pp.427-432, 2008 Second Asia International Conference on Modelling & Simulation, 2008
Usage of this product signifies your acceptance of the Terms of Use.