Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on (2009)
Sept. 15, 2009 to Sept. 18, 2009
The key to Deep Web crawling is to submit promising keywords to query form and retrieve Deep Web content efficiently. To select keywords, existing methods make a decision based on keywords’ statistic information deriving from TF and DF in local acquired records, thus work well only in textual databases providing full text search interfaces, whereas not well in structured databases of multi-attribute or field-restricted search interfaces. This paper proposes a novel Deep Web crawling method. Keywords are encoded as a tuple by its linguistic, statistic and HTML features so that a harvest rate evaluation model can be learned from the issued keywords for the un-issued in future. The method breaks through the assumption of plain-text search made by existing methods. Experimental results show that the method outperforms the state of the art methods.
Hidden Web, Deep Web surfacing, machine learning
Q. Zheng, L. Jiang, J. Liu and Z. Wu, "Learning Deep Web Crawling with Diverse Features," 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Milan, Italy, 2009, pp. 572-575.