The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.08 - August (2010 vol.22)
pp: 1176-1190
Heasoo Hwang , UC San Diego
Andrey Balmin , IBM Almaden Research Center, San Jose
Berthold Reinwald , IBM Almaden Research Center, San Jose
Erik Nijkamp , Technische Universität Berlin
ABSTRACT
Dynamic authority-based keyword search algorithms, such as ObjectRank and personalized PageRank, leverage semantic link information to provide high quality, high recall search in databases, and the Web. Conceptually, these algorithms require a query-time PageRank-style iterative computation over the full graph. This computation is too expensive for large graphs, and not feasible at query time. Alternatively, building an index of precomputed results for some or all keywords involves very expensive preprocessing. We introduce BinRank, a system that approximates ObjectRank results by utilizing a hybrid approach inspired by materialized views in traditional query processing. We materialize a number of relatively small subsets of the data graph in such a way that any keyword query can be answered by running ObjectRank on only one of the subgraphs. BinRank generates the subgraphs by partitioning all the terms in the corpus based on their co-occurrence, executing ObjectRank for each partition using the terms to generate a set of random walk starting points, and keeping only those objects that receive non-negligible scores. The intuition is that a subgraph that contains all objects and links relevant to a set of related terms should have all the information needed to rank objects with respect to one of these terms. We demonstrate that BinRank can achieve subsecond query execution time on the English Wikipedia data set, while producing high-quality search results that closely approximate the results of ObjectRank on the original graph. The Wikipedia link graph contains about 10^8 edges, which is at least two orders of magnitude larger than what prior state of the art dynamic authority-based search systems have been able to demonstrate. Our experimental evaluation investigates the trade-off between query execution time, quality of the results, and storage requirements of BinRank.
INDEX TERMS
Online keyword search, ObjectRank, scalability, approximation algorithms.
CITATION
Heasoo Hwang, Andrey Balmin, Berthold Reinwald, Erik Nijkamp, "BinRank: Scaling Dynamic Authority-Based Search Using Materialized Subgraphs", IEEE Transactions on Knowledge & Data Engineering, vol.22, no. 8, pp. 1176-1190, August 2010, doi:10.1109/TKDE.2010.85
REFERENCES
[1] S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Computer Networks, vol. 30, nos. 1-7, pp. 107-117, 1998.
[2] T.H. Haveliwala, "Topic-Sensitive PageRank," Proc. Int'l World Wide Web Conf. (WWW), 2002.
[3] G. Jeh and J. Widom, "Scaling Personalized Web Search," Proc. Int'l World Wide Web Conf. (WWW), 2003.
[4] D. Fogaras, B. Rácz, K. Csalogány, and T. Sarlós, "Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments," Internet Math., vol. 2, no. 3, pp. 333-358, 2005.
[5] K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova, "Monte Carlo Methods in PageRank Computation: When One Iteration Is Sufficient," SIAM J. Numerical Analysis, vol. 45, no. 2, pp. 890-904, 2007.
[6] A. Balmin, V. Hristidis, and Y. Papakonstantinou, "ObjectRank: Authority-Based Keyword Search in Databases," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2004.
[7] Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma, "Object-Level Ranking: Bringing Order to Web Objects," Proc. Int'l World Wide Web Conf. (WWW), pp. 567-574, 2005.
[8] S. Chakrabarti, "Dynamic Personalized PageRank in Entity-Relation Graphs," Proc. Int'l World Wide Web Conf. (WWW), 2007.
[9] H. Hwang, A. Balmin, H. Pirahesh, and B. Reinwald, "Information Discovery in Loosely Integrated Data," Proc. ACM SIGMOD, 2007.
[10] V. Hristidis, H. Hwang, and Y. Papakonstantinou, "Authority-Based Keyword Search in Databases," ACM Trans. Database Systems, vol. 33, no. 1, pp. 1-40, 2008.
[11] M. Kendall, Rank Correlation Methods. Hafner Publishing Co., 1955.
[12] M.R. Garey and D.S. Johnson, "A 71/60 Theorem for Bin Packing," J. Complexity, vol. 1, pp. 65-106, 1985.
[13] K.S. Beyer, P.J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla, "On Synopses for Distinct-Value Estimation under Multiset Operations," Proc. ACM SIGMOD, pp. 199-210, 2007.
[14] J.T. Bradley, D.V. de Jager, W.J. Knottenbelt, and A. Trifunovic, "Hypergraph Partitioning for Faster Parallel PageRank Computation," Proc. Second European Performance Evaluation Workshop (EPEW), pp. 155-171, 2005.
[15] J. Cho and U. Schonfeld, "Rankmass Crawler: A Crawler with High PageRank Coverage Guarantee," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2007.
17 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool