The Community for Technology Leaders
Parallel Architectures, Algorithms and Programming, International Symposium on (2011)
Tianjin, China
Dec. 9, 2011 to Dec. 11, 2011
ISBN: 978-0-7695-4575-2
pp: 330-333
ABSTRACT
We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can't work well.
INDEX TERMS
block-level links, merge block, content extraction
CITATION
Shixing Shen, Hui Zhang, "Block-Level Linkes Based Content Extraction", Parallel Architectures, Algorithms and Programming, International Symposium on, vol. 00, no. , pp. 330-333, 2011, doi:10.1109/PAAP.2011.49
169 ms
(Ver 3.3 (11022016))