9th International Database Engineering & Application Symposium (IDEAS'05)
Categorizing and Extracting Information from Multilingual HTML Documents
Montreal, Canada
July 25-July 27
ISBN: 0-7695-2404-4
The amount of online information written in different natural languages and the number of non-English speaking Internet users have been increasing tremendously during the past decade. In order to provide high-performance access of multilingual information on the Internet, we have developed a data analysis and querying system (DatAQs) that (i) analyzes, identifies, and categorizes languages used in HTML documents, (ii) extracts information from HTML documents of interest written in different languages, (iii) allows the user to submit queries for retrieving extracted information in the same natural language provided by the query engine of DatAQs using a menu-driven user interface, and (iv) processes the user?s queries (as Boolean expressions) to generate the results. DatAQs extracts information from HTML documents that belong to various data-rich, narrow-in-breadth application domains, such as car ads, house rentals, job ads, stocks, university catalogs, etc. The average F-measure on identifying HTML documents written in a particular natural language correctly is 89%, whereas the F-measure on categorizing HTML documents belonged to the car-ads application domain is 94%.
Citation:
SeungJin Lim, Yiu-Kai Ng, "Categorizing and Extracting Information from Multilingual HTML Documents," ideas, pp.415-422, 9th International Database Engineering & Application Symposium (IDEAS'05), 2005