This Article 
 Bibliographic References 
 Add to: 
Link Contexts in Classifier-Guided Topical Crawlers
January 2006 (vol. 18 no. 1)
pp. 107-122
Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a Web page. Link contexts have been applied to a variety of Web information retrieval and categorization tasks. Topical or focused Web crawlers have a special reliance on link contexts. These crawlers automatically navigate the hyperlinked structure of the Web while using link contexts to predict the benefit of following the corresponding hyperlinks with respect to some initiating topic or theme. Using topical crawlers that are guided by a Support Vector Machine, we investigate the effects of various definitions of link contexts on the crawling performance. We find that a crawler that exploits words both in the immediate vicinity of a hyperlink as well as the entire parent page performs significantly better than a crawler that depends on just one of those cues. Also, we find that a crawler that uses the tag tree hierarchy within Web pages provides effective coverage. We analyze our results along various dimensions such as link context quality, topic difficulty, length of crawl, training data, and topic domain. The study was done using multiple crawls over 100 topics covering millions of pages allowing us to derive statistically strong results.

[1] C.C. Aggarwal, “Collaborative Crawling: Mining User Experiences for Topical Resource Discovery,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 423-428, 2002.
[2] C.C. Aggarwal, F. Al-Garawi, and P.S. Yu, “Intelligent Crawling on the World Wide Web with Arbitrary Predicates,” Proc. 10th Int'l World Wide Web Conf., May 2001.
[3] G. Attardi, A. Gullí, and F. Sebastiani, “Automatic Web Page Categorization by Link and Context Analysis,” Proc. THAI-99, First European Symp. Telematics, Hypermedia, and Artificial Intelligence, 1999.
[4] S. Bradshaw, “Reference Directed Indexing: Redeeming Relevance for Subject Search in Citation Indexes,” Proc. Seventh European Conf. Research and Advanced Technology for Digital Libraries, 2003.
[5] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Computer Networks and ISDN Systems, vol. 30, nos. 1-7, pp. 107-117, 1998.
[6] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan, “Automatic Resource List Compilation by Analyzing Hyperlink Structure and Associated Text,” Proc. Seventh Int'l World Wide Web Conf., 1998.
[7] S. Chakrabarti, K. Punera, and M. Subramanyam, “Accelerated Focused Crawling through Online Relevance Feedback,” Proc. 11th Int'l World Wide Web Conf., May 2002.
[8] S. Chakrabarti, M. van den Berg, and B. Dom, “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” Proc. Eighth Int'l World Wide Web Conf., May 1999.
[9] H. Chen, M. Chau, and D. Zeng, “Ci Spider: A Tool for Competitive Intelligence on the Web,” Decision Support Systems, pp. 1-17, 2002.
[10] N. Craswell, D. Hawking, and S.E. Robertson, “Effective Site Finding Using Link Anchor Information,” Proc. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, 2001.
[11] B.D. Davison, “Topical Locality in the Web,” Proc. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, 2000.
[12] P.M.E. De Bra and R.D.J. Post, “Information Retrieval in the World Wide Web: Making Client-Based Searching Feasible,” Proc. First Int'l World Wide Web Conf., 1994.
[13] M. Diligenti, F. Coetzee, S. Lawrence, C.L. Giles, and M. Gori, “Focused Crawling Using Context Graphs,” Proc. 26th Int'l Conf. Very Large Data Bases, pp. 527-534, 2000.
[14] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley-Interscience, 2000.
[15] S.T. Dumais, “Using SNMs for Text Categorization,” IEEE Intelligent Systems Magazine, vol. 13, no. 4, 1998.
[16] E.J. Glover, K. Tsioutsiouliklis, S. Lawrence, D.M. Pennock, and G.W. Flake, “Using Web Structure for Classifying and Describing Web Pages,” Proc. 11th Int'l World Wide Web Conf., ACM Press, 2002.
[17] M. Hersovici, M. Jacovi, Y.S. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur, “The Shark-Search Algorithm— An Application: Tailored Web Site Mapping,” Proc. Seventh Int'l World Wide Web Conf., 1998.
[18] M. Iwazume, K. Shirakami, K. Hatadani, H. Takeda, and T. Nishida, “IICA: An Ontology-Based Internet Navigation System,” Proc. AAAI-96 Workshop Internet Based Information Systems, 1996.
[19] T. Joachims, “Learning to Classify Text Using Support Vector Machines,” PhD thesis, Kluwer, 2002.
[20] J. Johnson, K. Tsioutsiouliklis, and C.L. Giles, “Evolving Strategies for Focused Web Crawling,” Proc. Int'l Conf. Machine Learning, 2003.
[21] J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” J. ACM, vol. 46, no. 5, pp. 604-632, 1999.
[22] H. Lieberman, “Autonomous Interface Agents,” Proc. ACM Conf. Computers and Human Interface, 1997.
[23] O.A. McBryan, “Genvl and WWWW: Tools for Taming the Web,” Proc. First Int'l World Wide Web Conf., 1994.
[24] A.K. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating the Construction of Internet Portals with Machine Learning,” Information Retrieval, vol. 3, no. 2, pp. 127-163, 2000.
[25] F. Menczer and R.K. Belew, “Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web,” Machine Learning, vol. 39, nos. 2-3, pp. 203-242, 2000.
[26] F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan, “Evaluating Topic-Driven Web Crawlers,” Proc. 24th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, 2001.
[27] C. Olston and E.H. Chi, “ScentTrails: Integrating Browsing and Searching on the Web,” ACM Trans. Computer-Human Interaction, vol. 10, no. 3, pp. 177-197, Sept. 2003.
[28] G. Pant, “Deriving Link-Context from HTML Tag Tree,” Proc. Eighth ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD03), 2003.
[29] G. Pant and F. Menczer, “MySpiders: Evolve Your Own Intelligent Web Crawlers,” Autonomous Agents and Multi-Agent Systems, vol. 5, no. 2, pp. 221-229, 2002.
[30] G. Pant and F. Menczer, “Topical Crawling for Business Intelligence,” Proc. Seventh European Conf. Research and Advanced Technology for Digital Libraries, 2003.
[31] G. Pant and P. Srinivasan, “Learning to Crawl: Comparing Classification Schemes,” ACM Trans. Information Systems, vol. 23, no. 4, 2005.
[32] G. Pant, K. Tsioutsiouliklis, J. Johnson, and C.L. Giles, “Panorama: Extending Digital Libraries with Topical Crawlers,” Proc. Fourth ACM/IEEE-CS Joint Conf. Digital Libraries, pp. 142-150, 2004.
[33] J.C. Platt, “Fast Training of Support Vector Machines Using Sequential Minimal Optimization,” Advances in Kernel Methods: Support Vector Learning, pp. 185-208, MIT Press, 1999.
[34] J. Qin, Y. Zhou, and M. Chau, “Building Domain-Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method,” Proc. Fourth ACM/IEEE-CS Joint Conf. Digital Libraries, 2004.
[35] G. Salton, The SMART Retrieval System— Experiments in Automatic Document Processing. Englewood Cliffs, N.J.: Prentice Hall Inc., 1971.
[36] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[37] P. Srinivasan, F. Menczer, and G. Pant, “A General Evaluation Framework for Topical Crawlers,” Information Retrieval, vol. 8, no. 3, pp. 417-447, 2005.
[38] M. Subramanyam, G.V.R. Phanindra, M. Tiwari, and M. Jain, “Focused Crawling Using TFIDF Centroid,” Hypertext Retrieval and Mining, Apr. 2001.
[39] V.N. Vapnik, The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995.

Index Terms:
Index Terms- Web Search, Web mining, performance evaluation.
Gautam Pant, Padmini Srinivasan, "Link Contexts in Classifier-Guided Topical Crawlers," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 107-122, Jan. 2006, doi:10.1109/TKDE.2006.12
Usage of this product signifies your acceptance of the Terms of Use.