The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - October (2009 vol.20)
pp: 1422-1438
James Caverlee , Texas A&M University, College Station
Steve Webb , Purewire, Atlanta
Ling Liu , Georgia Institute of Technology, Atlanta
William B. Rouse , Georgia Institute of Technology, Atlanta
ABSTRACT
Link-based analysis of the Web provides the basis for many important applications—like Web search, Web-based data mining, and Web page categorization—that bring order to the massive amount of distributed Web content. Due to the overwhelming reliance on these important applications, there is a rise in efforts to manipulate (or spam) the link structure of the Web. In this manuscript, we present a parameterized framework for link analysis of the Web that promotes spam resilience through a source-centric view of the Web. We provide a rigorous study of the set of critical parameters that can impact source-centric link analysis and propose the novel notion of influence throttling for countering the influence of link-based manipulation. Through formal analysis and a large-scale experimental study, we show how different parameter settings may impact the time complexity, stability, and spam resilience of Web link analysis. Concretely, we find that the source-centric model supports more effective and robust rankings in comparison with existing Web algorithms such as PageRank.
INDEX TERMS
Internet search, information search and retrieval, information storage and retrieval, information technology and systems, distributed systems, systems and software, Web search, general, Web-based services, online information services.
CITATION
James Caverlee, Steve Webb, Ling Liu, William B. Rouse, "A Parameterized Approach to Spam-Resilient Link Analysis of the Web", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 10, pp. 1422-1438, October 2009, doi:10.1109/TPDS.2008.227
REFERENCES
[1] D. Fetterly, M. Manasse, and M. Najork, “Spam, Damn Spam, and Statistics,” Proc. Seventh Int'l Workshop the Web and Databases (WebDB), 2004.
[2] Z. Gyöngyi and H. Garcia-Molina, “Web Spam Taxonomy,” Proc. First Int'l Workshop Adversarial Information Retrieval on the Web (AIRWeb), 2005.
[3] C. Mann, “${\rm Spam} + {\rm Blogs} = {\rm Trouble}$ ,” Wired, 2006.
[4] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” J. ACM, vol. 46, no. 5, 1999.
[5] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” technical report, Stanford Univ., 1998.
[6] K. Bharat, B.-W. Chang, M. Henzinger, and M. Ruhl, “Who Links to Whom: Mining Linkage between Web Sites,” Proc. IEEE Int'l Conf. Data Mining (ICDM), 2001.
[7] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub, “Exploiting the Block Structure of the Web for Computing PageRank,” technical report, Stanford Univ., 2003.
[8] A. Broder, R. Lempel, F. Maghoul, and J. Pedersen, “Efficient PageRank Approximation via Graph Aggregation,” Proc. 13th Int'l World Wide Web Conf. (WWW), 2004.
[9] Y. Lu, B. Zhang, W. Xi, Z. Chen, Y. Liu, M.R. Lyu, and W.-Y. Ma, “The PowerRank Web Link Analysis Algorithm,” Proc. 13th Int'l World Wide Web Conf. (WWW), 2004.
[10] E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. So, “The Connectivity Sonar: Detecting Site Functionality by Structural Patterns,” Proc. 14th ACM Conf. Hypertext and Hypermedia, 2003.
[11] N. Eiron, K.S. McCurley, and J.A. Tomlin, “Ranking the Web Frontier,” Proc. 13th Int'l World Wide Web Conf. (WWW), 2004.
[12] P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “UbiCrawler: Scalability and Fault-Tolerance Issues,” Proc. 11th Int'l World Wide Web Conf. (WWW), 2002.
[13] A.L. da Costa Carvalho, P.A. Chirita, E.S. de Moura, P. Calado, and W. Nejdl, “Site Level Noise Removal for Search Engines,” Proc. 15th Int'l World Wide Web Conf. (WWW), 2006.
[14] Y. Wang and D.J. DeWitt, “Computing PageRank in a Distributed Internet Search Engine System,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[15] J. Caverlee and L. Liu, “Countering Web Spam with Credibility-Based Link Analysis,” Proc. 26th ACM SIGACT-SIGOPS Symp. Principles of Distributed Computing (PODC), 2007.
[16] L. Nie, B. Wu, and B.D. Davison, “A Cautious Surfer for PageRank,” Proc. 16th Int'l World Wide Web Conf. (WWW), 2007.
[17] B. Wu and B. Davison, “Identifying Link Farm Spam Pages,” Proc. 14th Int'l World Wide Web Conf. (WWW), 2005.
[18] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin, “PageRank Computation and the Structure of the Web,” Proc. 11th Int'l World Wide Web Conf. (WWW), 2002.
[19] T.-Y. Liu and W.-Y. Ma, “Webpage Importance Analysis Using Conditional Markov Random Walk,” Proc. IEEE/WIC/ACM Int'l Conf. Web Intelligence (WI), 2005.
[20] R. Song et al., “Microsoft Research Asia at Web Track and Terabyte Track,” Proc. 13th Text Retrieval Conf. (TREC), 2004.
[21] G.-R. Xue, Q. Yang, H.-J. Zeng, Y. Yu, and Z. Chen, “Exploiting the Hierarchical Structure for Link Analysis,” Proc. ACM Special Interest Group on Information Retrieval (SIGIR), 2005.
[22] J. Wu and K. Aberer, “Using SiteRank for P2P Web Retrieval,” technical report, Swiss Fed. Inst. of Tech nology, 2004.
[23] M. Bianchini, M. Gori, and F. Scarselli, “Inside PageRank,” ACM Trans. Internet Technology, vol. 5, 2005.
[24] Z. Gyöngyi and H. Garcia-Molina, “Link Spam Alliances,” Proc. 31st Int'l Conf. Very Large Data Bases (VLDB), 2005.
[25] A.N. Langville and C.D. Meyer, “Deeper Inside PageRank,” Internet Math., vol. 1, no. 3, 2005.
[26] Y.-M. Wang, M. Ma, Y. Niu, and H. Chen, “Spam Double-Funnel: Connecting Web Spammers with Advertisers,” Proc. 16th Int'l World Wide Web Conf. (WWW), 2007.
[27] P. Boldi and S. Vigna, “The WebGraph Framework I,” Proc. 13th Int'l World Wide Web Conf. (WWW), 2004.
[28] R. Fagin, R. Kumar, and D. Sivakumar, “Comparing Top $k$ Lists,” SIAM J. Discrete Math., vol. 17, no. 1, 2003.
[29] J. Lin, “Divergence Measures Based on the Shannon Entropy,” IEEE Trans. Information Theory, vol. 37, no. 1, 1991.
[30] M. Kendall and J.D. Gibbons, Rank Correlation Methods. Edward Ar nold, 1990.
[31] P. Boldi, M. Santini, and S. Vigna, “Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations,” Proc. Third Int'l Workshop Algorithms and Models for the Web-Graph (WAW), 2004.
[32] A.Y. Ng, A.X. Zheng, and M.I. Jordan, “Stable Algorithms for Link Analysis,” Proc. ACM Special Interest Group on Information Retrieval (SIGIR), 2001.
[33] S. Dill, R. Kumar, K.S. Mccurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins, “Self-Similarity in the Web,” ACM Trans. Internet Technology, vol. 2, no. 3, 2002.
[34] M. Ester, H.-P. Kriegel, and M. Schubert, “Accurate and Efficient Crawling for Relevant Websites,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[35] M. Thelwall, “New Versions of PageRank Employing Alternative Web Document Models,” Proc. Assoc. for Information Management (ASLIB), vol. 56, no. 1, 2004.
[36] D. Cai, X. He, J.-R. Wen, and W.-Y. Ma, “Block-Level Link Analysis,” Proc. ACM Special Interest Group on Information Retrieval (SIGIR), 2004.
[37] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen, “Combating Web Spam with TrustRank,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[38] H. Zhang, A. Goel, R. Govindan, and K. Mason, “Improving Eigenvector-Based Reputation Systems against Collusions,” Proc. Third Int'l Workshop Algorithms and Models for the Web-Graph (WAW), 2004.
[39] R. Baeza-Yates, C. Castillo, and V. Lopez, “PageRank Increase under Different Collusion Topologies,” Proc. First Int'l Workshop Adversarial Information Retrieval on the Web (AIRWeb), 2005.
[40] S. Adali, T. Liu, and M. Magdon-Ismail, “Optimal Link Bombs Are Uncoordinated,” Proc. First Int'l Workshop Adversarial Information Retrieval on the Web (AIRWeb), 2005.
[41] B. Davison, “Recognizing Nepotistic Links on the Web,” Proc. AAAI Workshop Artificial Intelligence for Web Search, 2000.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool