The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - December (2011 vol.23)
pp: 1872-1887
Shaoxu Song , The Hong Kong University of Science and Technology, Hong Kong
Lei Chen , The Hong Kong University of Science and Technology, Hong Kong
Mingxuan Yuan , The Hong Kong University of Science and Technology, Hong Kong
ABSTRACT
Dataspaces consist of large-scale heterogeneous data. The query interface of accessing tuples should be provided as a fundamental facility by practical dataspace systems. Previously, an efficient index has been proposed for queries with keyword neighborhood over dataspaces. In this paper, we study the materialization and decomposition of dataspaces, in order to improve the query efficiency. First, we study the views of items, which are materialized in order to be reused by queries. When a set of views are materialized, it leads to select some of them as the optimal plan with the minimum query cost. Efficient algorithms are developed for query planning and view generation. Second, we study the partitions of tuples for answering top-k queries. Given a query, we can evaluate the score bounds of the tuples in partitions and prune those partitions with bounds lower than the scores of top-k answers. We also provide theoretical analysis of query cost and prove that the query efficiency cannot be improved by increasing the number of partitions. Finally, we conduct an extensive experimental evaluation to illustrate the superior performance of proposed techniques.
INDEX TERMS
Dataspaces, materialization, decomposition.
CITATION
Shaoxu Song, Lei Chen, Mingxuan Yuan, "Materialization and Decomposition of Dataspaces for Efficient Search", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 12, pp. 1872-1887, December 2011, doi:10.1109/TKDE.2010.213
REFERENCES
[1] M.J. Franklin, A.Y. Halevy, and D. Maier, "From Databases to Dataspaces: A New Abstraction for Information Management," SIGMOD Record, vol. 34, no. 4, pp. 27-33, 2005.
[2] A.Y. Halevy, M.J. Franklin, and D. Maier, "Principles of Dataspace Systems," Proc. 25th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS '06), pp. 1-9, 2006.
[3] M.J. Franklin, A.Y. Halevy, and D. Maier, "A First Tutorial on Dataspaces," Proc. VLDB Endowment, vol. 1, no. 2, pp. 1516-1517, 2008.
[4] J. Madhavan, S. Cohen, X.L. Dong, A.Y. Halevy, S.R. Jeffery, D. Ko, and C. Yu, "Web-Scale Data Integration: You can Afford to Pay as You Go," Proc. Conf. Innovative Data Systems Research (CIDR), pp. 342-350, 2007.
[5] S.R. Jeffery, M.J. Franklin, and A.Y. Halevy, "Pay-As-You-Go User Feedback for Dataspace Systems," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '08), pp. 847-860, 2008.
[6] A.D. Sarma, X. Dong, and A.Y. Halevy, "Bootstrapping Pay-As-You-Go Data Integration Systems," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '08), pp. 861-874, 2008.
[7] M.A.V. Salles, J.-P. Dittrich, S.K. Karakashian, O.R. Girard, and L. Blunschi, "Itrails: Pay-As-You-Go Information Integration in Dataspaces," Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB '07), pp. 663-674, 2007.
[8] F.M. Suchanek, G. Kasneci, and G. Weikum, "Yago: A Core of Semantic Knowledge," Proc. 16th Int'l Conf. World Wide Web (WWW '07), pp. 697-706, 2007.
[9] E. Rahm and P.A. Bernstein, "A Survey of Approaches to Automatic Schema Matching," Int'l J. Very Large Data Bases, vol. 10, no. 4, pp. 334-350, 2001.
[10] X. Dong and A.Y. Halevy, "Indexing Dataspaces," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '07), pp. 43-54, 2007.
[11] R. Fagin, "Combining Fuzzy Information: An Overview," SIGMOD Record, vol. 31, no. 2, pp. 109-118, 2002.
[12] H. Bast, D. Majumdar, R. Schenkel, M. Theobald, and G. Weikum, "Io-Top-k: Index-Access Optimized Top-k Query Processing," Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB '06), pp. 475-486, 2006.
[13] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.
[14] F. Liu, C.T. Yu, W. Meng, and A. Chowdhury, "Effective Keyword Search in Relational Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '06), pp. 563-574, 2006.
[15] H. Bast and I. Weber, "The Completesearch Engine: Interactive, Efficient, and Towards IR& DB Integration," Proc. Conf. Innovative Data Systems Research (CIDR), pp. 88-95, 2007.
[16] R.A. Baeza-Yates and B.A. Ribeiro-Neto, Modern Information Retrieval. ACM Press / Addison-Wesley, 1999.
[17] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, second ed. Morgan Kaufmann, 1999.
[18] J. Zobel and A. Moffat, "Inverted Files for Text Search Engines," ACM Computing Surveys, vol. 38, no. 2, pp. 1-55, 2006.
[19] R. Fagin, A. Lotem, and M. Naor, "Optimal Aggregation Algorithms for Middleware," Proc. 20th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS '01), 2001.
[20] D. Peleg, G. Schechtman, and A. Wool, "Approximating Bounded 0-1 Integer Linear Programs," Proc. Second Israel Symp. Theory and Computing Systems, pp. 69-77, 1993.
[21] C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Inc., 1982.
[22] V. Chvatal, "A Greedy Heuristic for the Set-Covering Problem," Math. Operations Research, vol. 4, no. 3, pp. 233-235, 1979.
[23] G. Dobson, "Worst Case Analysis of Greedy Heuristics for Integer Programming with Non-Negative Data," Math. Operations Research, vol. 7, no. 4, pp. 515-531, 1982.
[24] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules in Large Databases," Proc. 20th Int'l Conf. Very Large Data Bases (VLDB '94), pp. 487-499, 1994.
[25] J. Han, J. Pei, Y. Yin, and R. Mao, "Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach," Data Mining and Knowledge Discovery, vol. 8, no. 1, pp. 53-87, 2004.
[26] L. Lim, M. Wang, S. Padmanabhan, J.S. Vitter, and R.C. Agarwal, "Efficient Update of Indexes for Dynamically Changing Web Documents," J. World Wide Web, vol. 10, no. 1, pp. 37-69, 2007.
[27] P. Grassberger and I. Procaccia, "Measuring the Strangeness of Strange Attractors," Physica D: Nonlinear Phenomena, vol. 9, nos. 1/2, pp. 189-208, 1983.
[28] A. Belussi and C. Faloutsos, "Estimating the Selectivity of Spatial Queries Using the "Correlation" Fractal Dimension," Proc. 21th Int'l Conf. Very Large Data Bases (VLDB '95), pp. 299-310, 1995.
[29] B.-U. Pagel, F. Korn, and C. Faloutsos, "Deflating the Dimensionality Curse Using Multiple Fractal Dimensions," Proc. 16th Int'l Conf. Data Eng., pp. 589-598, 2000.
[30] F. Korn, B.-U. Pagel, and C. Faloutsos, "On the "Dimensionality Curse" and the "Self-Similarity Blessing,"" IEEE Trans. Knowledge and Data Eng., vol. 13, no. 1, pp. 96-111, Jan./Feb. 2001.
[31] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
[32] B. Li, M. Hui, J. Li, and H. Gao, "Iva-File: Efficiently Indexing Sparse Wide Tables in Community Systems," Proc. IEEE Int'l Conf. Data Eng. (ICDE '09), pp. 210-221, 2009.
[33] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv, "Xsearch: A Semantic Search Engine for Xml," Proc. 29th Int'l Conf. Very Large Data Bases (VLDB '03), pp. 45-56, 2003.
[34] S. Sarawagi and A. Kirpal, "Efficient Set Joins on Similarity Predicates," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '04), pp. 743-754, 2004.
[35] E. Chu, J.L. Beckmann, and J.F. Naughton, "The Case for a Wide-Table Approach to Manage Sparse Relational Data Sets," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '07), pp. 821-832, 2007.
[36] E. Chu, A. Baid, T. Chen, A. Doan, and J.F. Naughton, "A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data," Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB '07), pp. 1045-1056, 2007.
[37] R. Agrawal, A. Somani, and Y. Xu, "Storage and Querying of e-Commerce Data," Proc. 27th Int'l Conf. Very Large Data Bases (VLDB '01), pp. 149-158, 2001.
[38] J.L. Beckmann, A. Halverson, R. Krishnamurthy, and J.F. Naughton, "Extending RDBMSs to Support Sparse Datasets Using an Interpreted Attribute Storage Format," Proc. 22nd Int'l Conf. Data Eng. (ICDE '06), p. 58, 2006.
[39] D.J. Abadi, A. Marcus, S. Madden, and K.J. Hollenbach, "Scalable Semantic Web Data Management Using Vertical Partitioning," Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB '07), pp. 411-422, 2007.
[40] D. Abadi, S. Madden, and N. Hachem, "Column-Stores Vs. Row-Stores: How Different are They Really?" Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '08), 2008.
[41] S. Chaudhuri, V. Ganti, and R. Kaushik, "A Primitive Operator for Similarity Joins in Data Cleaning," Proc. 22nd Int'l Conf. Data Eng. (ICDE '06), p. 5, 2006.
[42] A. Arasu, V. Ganti, and R. Kaushik, "Efficient Exact Set-Similarity Joins," Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB '06), pp. 918-929, 2006.
[43] E. Ukkonen, "Approximate String Matching with q-Grams and Maximal Matches," Theoretical Computer Science—Selected Papers of the Combinatorial Pattern Matching School, vol. 92, no. 1, pp. 191-211, 1992.
[44] A.P. de Vries, N. Mamoulis, N. Nes, and M.L. Kersten, "Efficient k-NN Search on Vertically Decomposed Data," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '02), pp. 322-333, 2002.
[45] G. Gou, M. Kormilitsin, and R. Chirkova, "Query Evaluation Using Overlapping Views: Completeness and Efficiency," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), pp. 37-48, 2006.
[46] S. Chaudhuri, M. Datar, and V.R. Narasayya, "Index Selection for Databases: A Hardness Study and a Principled Heuristic Solution," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1313-1323, Nov. 2004.
[47] S. Agrawal, S. Chaudhuri, and V.R. Narasayya, "Automated Selection of Materialized Views and Indexes in Sql Databases," Proc. 26th Int'l Conf. Very Large Data Bases (VLDB '00), pp. 496-505, 2000.
[48] G. Valentin, M. Zuliani, D.C. Zilio, G.M. Lohman, and A. Skelley, "DB2 Advisor: An Optimizer Smart Enough to Recommend Its Own Indexes," Proc. 16th Int'l Conf. Data Eng., pp. 101-110, 2000.
[49] R. Chirkova and C. Li, "Materializing Views with Minimal Size to Answer Queries," Proc. 22nd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS '03), pp. 38-48, 2003.
[50] C. Heeren, H.V. Jagadish, and L. Pitt, "Optimal Indexing Using Near-Minimal Space," Proc. 22nd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS '03), pp. 244-251, 2003.
[51] K. Aouiche and J. Darmont, "Data Mining-Based Materialized View and Index Selection in Data Warehouses," J. Intelligent Information Systems, vol. 33, no. 1, pp. 65-93, 2009.
[52] N. Lester, A. Moffat, and J. Zobel, "Fast On-Line Index Construction by Geometric Partitioning," Proc. 14th ACM Int'l Conf. Information and Knowledge Management (CIKM '05), pp. 776-783, 2005.
[53] N. Mamoulis, "Efficient Processing of Joins on Set-Valued Attributes," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), pp. 157-168, 2003.
[54] S. Idreos, M.L. Kersten, and S. Manegold, "Database Cracking," Proc. Conf. Innovative Data Systems Research (CIDR), pp. 68-78, 2007.
[55] S. Idreos, M.L. Kersten, and S. Manegold, "Updating a Cracked Database," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '07), pp. 413-424, 2007.
25 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool