Parallel Architectures, Algorithms and Programming, International Symposium on (2011)
Dec. 9, 2011 to Dec. 11, 2011
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/PAAP.2011.49
We present block-level links based content extraction (BLCE)-a method to extract content from the web pages by using the link attributes of blocks, which contains the number of links and the length of link text (anchor text).We describe how to divide one web page into blocks and how to merge the similar blocks into one, then compute the number of links and the total length of anchor text. We find that extracting content only with the number of links and length of anchor text is not effective because the number of links and length of link text are proportional to the length of page. Density of links is a good method to solve this. So we use the content links ratios and the content anchor text ratios to describe the link attribute of the blocks. BLCE performs better than other methods especially in the new web pages with DIV and CSS where traditional algorithm can't work well.
block-level links, merge block, content extraction
S. Shen and H. Zhang, "Block-Level Linkes Based Content Extraction," Parallel Architectures, Algorithms and Programming, International Symposium on(PAAP), Tianjin, China, 2011, pp. 330-333.