|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
2008 Eighth IEEE International Conference on Data Mining
xCrawl: A High-Recall Crawling Method for Web Mining
December 15-December 19
ISBN: 978-0-7695-3502-9
| ASCII Text | x | ||
| Kostyantyn Shchekotykhin, Dietmar Jannach, Gerhard Friedrich, "xCrawl: A High-Recall Crawling Method for Web Mining," Data Mining, IEEE International Conference on, pp. 550-559, 2008 Eighth IEEE International Conference on Data Mining, 2008. | |||
| BibTex | x | ||
| @article{ 10.1109/ICDM.2008.121, author = {Kostyantyn Shchekotykhin and Dietmar Jannach and Gerhard Friedrich}, title = {xCrawl: A High-Recall Crawling Method for Web Mining}, journal ={Data Mining, IEEE International Conference on}, volume = {0}, year = {2008}, issn = {1550-4786}, pages = {550-559}, doi = {http://doi.ieeecomputersociety.org/10.1109/ICDM.2008.121}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - Data Mining, IEEE International Conference on TI - xCrawl: A High-Recall Crawling Method for Web Mining SN - 1550-4786 SP550 EP559 A1 - Kostyantyn Shchekotykhin, A1 - Dietmar Jannach, A1 - Gerhard Friedrich, PY - 2008 KW - Web mining KW - focused crawling KW - authorative sources VL - 0 JA - Data Mining, IEEE International Conference on ER - | |||
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2008.121
Web Mining Systems exploit the redundancy of data published on the Web to automatically extract information from existing web documents. The first step in the Information Extraction process is thus to locate within a limited period of time as many web pages as possible that contain relevant information, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its "recall", i.e. the percentage of documents found and identified as relevant compared to the number of existing documents. A higher recall value implies that more redundant data is available, which in turn leads to better results in the subsequent fact extraction phase. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit navigational structures of websites, such as hierarchies, lists or maps. In addition, automatic query generation is applied to rapidly collect web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web Mining System developed to extract product and service descriptions and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall whilst maintaining precision.
Index Terms:
Web mining, focused crawling, authorative sources
Citation:
Kostyantyn Shchekotykhin, Dietmar Jannach, Gerhard Friedrich, "xCrawl: A High-Recall Crawling Method for Web Mining," icdm, pp.550-559, 2008 Eighth IEEE International Conference on Data Mining, 2008
Usage of this product signifies your acceptance of the Terms of Use.
