Web-Age Information Management, International Conference on (2008)
July 20, 2008 to July 22, 2008
ISBN: 978-0-7695-3185-4
pp: 31-36
With the rapid development of the internet technology, the structured data are more and more prevalent in the Internet. Moreover, most web sites organize their data systematically and relevant data may be separated into different pages but linked through hyperlinks. However, the existing web search engines cannot integrate information from multiple interrelated pages to answer keyword queries meaningfully. Next-generation web search engines require link-awareness, or more generally, the capability of integrating correlative information items that are linked through hyperlinks. In this paper, we study the problems of identifying the "information unit" of relevant pages containing all the input keywords as the answer. We model a set of most related web pages as a tree, where the nodes in the tree are the web pages and the edges are the links between the web pages. We retrieve the "Information Unit" of the most related and connected subtrees instead of single web page as the answer. To improve the search efficiency, we propose an effective LCA-based algorithm to identify those subtrees which are most related to the given input keywords. We have conducted a set of extensive experiments on the proposed algorithm. The experimental results show that our method achieves high search performance and outperforms the existing alternative methods significantly.
Keyword Search, Information Unit, LCA

