The Community for Technology Leaders
RSS Icon
Issue No.06 - June (2008 vol.20)
pp: 796-808
The world-wide web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the web. Moreover, different web sites often provide conflicting information on a subject, such as different specifications for the same product. In this paper we propose a new problem called Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various web sites. We design a general framework for the Veracity problem, and invent an algorithm called TruthFinder, which utilizes the relationships between web sites and their information, i.e., a web site is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy web sites. An iterative method is used to infer the trustworthiness of web sites and the correctness of information from each other. Our experiments show that TruthFinder successfully finds true facts among conflicting information, and identifies trustworthy web sites better than the popular search engines.
Data mining, Web mining
Xiaoxin Yin, Jiawei Han, Philip S. Yu, "Truth Discovery with Multiple Conflicting Information Providers on the Web", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 6, pp. 796-808, June 2008, doi:10.1109/TKDE.2007.190745
[1] B. Amento, L.G. Terveen, and W.C. Hill, “Does ‘Authority’ Mean Quality? Predicting Expert Quality Ratings of Web Documents,” Proc. ACM SIGIR '00, July 2000.
[2] M. Blaze, J. Feigenbaum, and J. Lacy, “Decentralized Trust Management,” Proc. IEEE Symp. Security and Privacy (ISSP '96), May 1996.
[3] A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, “Link Analysis Ranking: Algorithms, Theory, and Experiments,” ACM Trans. Internet Technology, vol. 5, no. 1, pp. 231-297, 2005.
[4] J.S. Breese, D. Heckerman, and C. Kadie, “Empirical Analysis of Predictive Algorithms for Collaborative Filtering,” technical report, Microsoft Research, 1998.
[5] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins, “Propagation of Trust and Distrust,” Proc. 13th Int'l Conf. World Wide Web (WWW), 2004.
[6] G. Jeh and J. Widom, “SimRank: A Measure of Structural-Context Similarity,” Proc. ACM SIGKDD '02, July 2002.
[7] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” J. ACM, vol. 46, no. 5, pp. 604-632, 1999.
[8] Logistical Equation from Wolfram MathWorld, http://mathworld.wolfram.comLogisticEquation.html , 2008.
[9] T. Mandl, “Implementation and Evaluation of a Quality-Based Search Engine,” Proc. 17th ACM Conf. Hypertext and Hypermedia, Aug. 2006.
[10] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” technical report, Stanford Digital Library Technologies Project, 1998.
[11] Princeton Survey Research Associates International, “Leap of faith: Using the Internet Despite the Dangers,” Results of a Nat'l Survey of Internet Users for Consumer Reports WebWatch, Oct. 2005.
[12] Sigmoid Function from Wolfram MathWorld, http://mathworld. wolfram.comSigmoidFunction.html , 2008.
[13] R.Y. Wang and D.M. Strong, “Beyond Accuracy: What Data Quality Means to Data Consumers,” J. Management Information Systems, vol. 12, no. 4, pp. 5-34, 1997.
[14] X. Yin, J. Han, and P.S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic Links,” Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB '06), Sept. 2006.
[15] X. Zhu and S. Gauch, “Incorporating Quality Metrics in Centralized/Distributed Information Retrieval on the World Wide Web,” Proc. ACM SIGIR '00, July 2000.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool