loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fourth International Conference on Web Information Systems Engineering (WISE'03)
Crawling for Domain-Speci.c Hidden Web Resources
Roma, Italy
December 10-December 12
ISBN: 0-7695-1999-7
André Bergholz, Xerox Research Centre Europe
Boris Chidlovskii, Xerox Research Centre Europe
The Hidden Web, the part of the Web that remains unavailable for standard crawlers, has become an important research topic during recent years. Its size is estimated to 400 to 500 times larger than that of the Publicly Indexable Web (PIW). Furthermore, the information on the Hidden Web is assumed to be more structured, because it is usually stored in databases. In this paper we describe a crawler which starting from the PIW finds entry points into the Hidden Web. The crawler is domain-specific and is initialized with pre-classified documents and relevant keywords. We describe our approach to the automatic identification of Hidden Web resources among encountered HTML forms. We conduct a series of experiments using the top-level categories in the Google Directory and report our analysis of the discovered Hidden Web resources.
Citation:
André Bergholz, Boris Chidlovskii, "Crawling for Domain-Speci.c Hidden Web Resources," wise, pp.125, Fourth International Conference on Web Information Systems Engineering (WISE'03), 2003
Usage of this product signifies your acceptance of the Terms of Use.