loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
9th International Database Engineering & Application Symposium (IDEAS'05)
Categorizing and Extracting Information from Multilingual HTML Documents
Montreal, Canada
July 25-July 27
ISBN: 0-7695-2404-4
SeungJin Lim, Utah State University
Yiu-Kai Ng, Brigham Young University
The amount of online information written in different natural languages and the number of non-English speaking Internet users have been increasing tremendously during the past decade. In order to provide high-performance access of multilingual information on the Internet, we have developed a data analysis and querying system (DatAQs) that (i) analyzes, identifies, and categorizes languages used in HTML documents, (ii) extracts information from HTML documents of interest written in different languages, (iii) allows the user to submit queries for retrieving extracted information in the same natural language provided by the query engine of DatAQs using a menu-driven user interface, and (iv) processes the user?s queries (as Boolean expressions) to generate the results. DatAQs extracts information from HTML documents that belong to various data-rich, narrow-in-breadth application domains, such as car ads, house rentals, job ads, stocks, university catalogs, etc. The average F-measure on identifying HTML documents written in a particular natural language correctly is 89%, whereas the F-measure on categorizing HTML documents belonged to the car-ads application domain is 94%.
Citation:
SeungJin Lim, Yiu-Kai Ng, "Categorizing and Extracting Information from Multilingual HTML Documents," ideas, pp.415-422, 9th International Database Engineering & Application Symposium (IDEAS'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.