This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Parameterized Approach to Spam-Resilient Link Analysis of the Web
October 2009 (vol. 20 no. 10)
pp. 1422-1438
James Caverlee, Texas A&M University, College Station
Steve Webb, Purewire, Atlanta
Ling Liu, Georgia Institute of Technology, Atlanta
William B. Rouse, Georgia Institute of Technology, Atlanta
Link-based analysis of the Web provides the basis for many important applications—like Web search, Web-based data mining, and Web page categorization—that bring order to the massive amount of distributed Web content. Due to the overwhelming reliance on these important applications, there is a rise in efforts to manipulate (or spam) the link structure of the Web. In this manuscript, we present a parameterized framework for link analysis of the Web that promotes spam resilience through a source-centric view of the Web. We provide a rigorous study of the set of critical parameters that can impact source-centric link analysis and propose the novel notion of influence throttling for countering the influence of link-based manipulation. Through formal analysis and a large-scale experimental study, we show how different parameter settings may impact the time complexity, stability, and spam resilience of Web link analysis. Concretely, we find that the source-centric model supports more effective and robust rankings in comparison with existing Web algorithms such as PageRank.

[1] D. Fetterly, M. Manasse, and M. Najork, “Spam, Damn Spam, and Statistics,” Proc. Seventh Int'l Workshop the Web and Databases (WebDB), 2004.
[2] Z. Gyöngyi and H. Garcia-Molina, “Web Spam Taxonomy,” Proc. First Int'l Workshop Adversarial Information Retrieval on the Web (AIRWeb), 2005.
[3] C. Mann, “${\rm Spam} + {\rm Blogs} = {\rm Trouble}$ ,” Wired, 2006.
[4] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” J. ACM, vol. 46, no. 5, 1999.
[5] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” technical report, Stanford Univ., 1998.
[6] K. Bharat, B.-W. Chang, M. Henzinger, and M. Ruhl, “Who Links to Whom: Mining Linkage between Web Sites,” Proc. IEEE Int'l Conf. Data Mining (ICDM), 2001.
[7] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub, “Exploiting the Block Structure of the Web for Computing PageRank,” technical report, Stanford Univ., 2003.
[8] A. Broder, R. Lempel, F. Maghoul, and J. Pedersen, “Efficient PageRank Approximation via Graph Aggregation,” Proc. 13th Int'l World Wide Web Conf. (WWW), 2004.
[9] Y. Lu, B. Zhang, W. Xi, Z. Chen, Y. Liu, M.R. Lyu, and W.-Y. Ma, “The PowerRank Web Link Analysis Algorithm,” Proc. 13th Int'l World Wide Web Conf. (WWW), 2004.
[10] E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. So, “The Connectivity Sonar: Detecting Site Functionality by Structural Patterns,” Proc. 14th ACM Conf. Hypertext and Hypermedia, 2003.
[11] N. Eiron, K.S. McCurley, and J.A. Tomlin, “Ranking the Web Frontier,” Proc. 13th Int'l World Wide Web Conf. (WWW), 2004.
[12] P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “UbiCrawler: Scalability and Fault-Tolerance Issues,” Proc. 11th Int'l World Wide Web Conf. (WWW), 2002.
[13] A.L. da Costa Carvalho, P.A. Chirita, E.S. de Moura, P. Calado, and W. Nejdl, “Site Level Noise Removal for Search Engines,” Proc. 15th Int'l World Wide Web Conf. (WWW), 2006.
[14] Y. Wang and D.J. DeWitt, “Computing PageRank in a Distributed Internet Search Engine System,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[15] J. Caverlee and L. Liu, “Countering Web Spam with Credibility-Based Link Analysis,” Proc. 26th ACM SIGACT-SIGOPS Symp. Principles of Distributed Computing (PODC), 2007.
[16] L. Nie, B. Wu, and B.D. Davison, “A Cautious Surfer for PageRank,” Proc. 16th Int'l World Wide Web Conf. (WWW), 2007.
[17] B. Wu and B. Davison, “Identifying Link Farm Spam Pages,” Proc. 14th Int'l World Wide Web Conf. (WWW), 2005.
[18] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin, “PageRank Computation and the Structure of the Web,” Proc. 11th Int'l World Wide Web Conf. (WWW), 2002.
[19] T.-Y. Liu and W.-Y. Ma, “Webpage Importance Analysis Using Conditional Markov Random Walk,” Proc. IEEE/WIC/ACM Int'l Conf. Web Intelligence (WI), 2005.
[20] R. Song et al., “Microsoft Research Asia at Web Track and Terabyte Track,” Proc. 13th Text Retrieval Conf. (TREC), 2004.
[21] G.-R. Xue, Q. Yang, H.-J. Zeng, Y. Yu, and Z. Chen, “Exploiting the Hierarchical Structure for Link Analysis,” Proc. ACM Special Interest Group on Information Retrieval (SIGIR), 2005.
[22] J. Wu and K. Aberer, “Using SiteRank for P2P Web Retrieval,” technical report, Swiss Fed. Inst. of Tech nology, 2004.
[23] M. Bianchini, M. Gori, and F. Scarselli, “Inside PageRank,” ACM Trans. Internet Technology, vol. 5, 2005.
[24] Z. Gyöngyi and H. Garcia-Molina, “Link Spam Alliances,” Proc. 31st Int'l Conf. Very Large Data Bases (VLDB), 2005.
[25] A.N. Langville and C.D. Meyer, “Deeper Inside PageRank,” Internet Math., vol. 1, no. 3, 2005.
[26] Y.-M. Wang, M. Ma, Y. Niu, and H. Chen, “Spam Double-Funnel: Connecting Web Spammers with Advertisers,” Proc. 16th Int'l World Wide Web Conf. (WWW), 2007.
[27] P. Boldi and S. Vigna, “The WebGraph Framework I,” Proc. 13th Int'l World Wide Web Conf. (WWW), 2004.
[28] R. Fagin, R. Kumar, and D. Sivakumar, “Comparing Top $k$ Lists,” SIAM J. Discrete Math., vol. 17, no. 1, 2003.
[29] J. Lin, “Divergence Measures Based on the Shannon Entropy,” IEEE Trans. Information Theory, vol. 37, no. 1, 1991.
[30] M. Kendall and J.D. Gibbons, Rank Correlation Methods. Edward Ar nold, 1990.
[31] P. Boldi, M. Santini, and S. Vigna, “Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations,” Proc. Third Int'l Workshop Algorithms and Models for the Web-Graph (WAW), 2004.
[32] A.Y. Ng, A.X. Zheng, and M.I. Jordan, “Stable Algorithms for Link Analysis,” Proc. ACM Special Interest Group on Information Retrieval (SIGIR), 2001.
[33] S. Dill, R. Kumar, K.S. Mccurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins, “Self-Similarity in the Web,” ACM Trans. Internet Technology, vol. 2, no. 3, 2002.
[34] M. Ester, H.-P. Kriegel, and M. Schubert, “Accurate and Efficient Crawling for Relevant Websites,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[35] M. Thelwall, “New Versions of PageRank Employing Alternative Web Document Models,” Proc. Assoc. for Information Management (ASLIB), vol. 56, no. 1, 2004.
[36] D. Cai, X. He, J.-R. Wen, and W.-Y. Ma, “Block-Level Link Analysis,” Proc. ACM Special Interest Group on Information Retrieval (SIGIR), 2004.
[37] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen, “Combating Web Spam with TrustRank,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[38] H. Zhang, A. Goel, R. Govindan, and K. Mason, “Improving Eigenvector-Based Reputation Systems against Collusions,” Proc. Third Int'l Workshop Algorithms and Models for the Web-Graph (WAW), 2004.
[39] R. Baeza-Yates, C. Castillo, and V. Lopez, “PageRank Increase under Different Collusion Topologies,” Proc. First Int'l Workshop Adversarial Information Retrieval on the Web (AIRWeb), 2005.
[40] S. Adali, T. Liu, and M. Magdon-Ismail, “Optimal Link Bombs Are Uncoordinated,” Proc. First Int'l Workshop Adversarial Information Retrieval on the Web (AIRWeb), 2005.
[41] B. Davison, “Recognizing Nepotistic Links on the Web,” Proc. AAAI Workshop Artificial Intelligence for Web Search, 2000.

Index Terms:
Internet search, information search and retrieval, information storage and retrieval, information technology and systems, distributed systems, systems and software, Web search, general, Web-based services, online information services.
Citation:
James Caverlee, Steve Webb, Ling Liu, William B. Rouse, "A Parameterized Approach to Spam-Resilient Link Analysis of the Web," IEEE Transactions on Parallel and Distributed Systems, vol. 20, no. 10, pp. 1422-1438, Oct. 2009, doi:10.1109/TPDS.2008.227
Usage of this product signifies your acceptance of the Terms of Use.