This Article 
 Bibliographic References 
 Add to: 
Meshing Streaming Updates with Persistent Data in an Active Data Warehouse
July 2008 (vol. 20 no. 7)
pp. 976-991
An active warehouse is refreshed on-line and thus achieves a higher consistency between the stored information and the latest data updates. The need for on-line warehouse refreshment introduces several challenges in the implementation of data warehouse transformations. In this article, we focus on a frequently encountered operation in this context, namely, the join of a fast stream $S$ of source updates with a disk-based relation $R$, under the constraint of limited memory. This operation lies at the core of several common transformations, such as, surrogate key assignment, duplicate detection or identification of newly inserted tuples. We propose a specialized join algorithm, termed MeshJoin , that compensates for the difference in the access cost of the two join inputs by (a) relying entirely on fast sequential scans of $R$, and (b) sharing the I/O cost of accessing $R$ across multiple tuples of $S$. We detail the MeshJoin algorithm and develop a systematic cost model that enables tuning MeshJoin based on the available memory and the desired throughput. We present an experimental study that validates the performance of MeshJoin on synthetic and real-life data. Our results verify the effectiveness of MeshJoin and demonstrate its advantages over existing join algorithms.

[1] D. Burleson, New Developments in Oracle Data Warehousing. Burleson Consulting, Apr. 2004.
[2] A. Karakasidis, P. Vassiliadis, and E. Pitoura, “ETL Queues for Active Data Warehousing,” Proc. Second Int'l Workshop Information Quality in Information Systems (IQIS), 2005.
[3] “On-Time Data Warehousing with Oracle10g—Information at the Speed of Your Business,” white paper, Oracle Corp., Aug. 2003.
[4] C. White, “Intelligent Business Strategies: Real-Time Data Warehousing Heats Up,” DM Rev., 2002.
[5] S. Chen, J. Chen, X. Zhang, and E.A. Rundensteiner, “Detection and Correction of Conflicting Source Updates for View Maintenance,” Proc. IEEE Int'l Conf. Data Eng. (ICDE '04), pp. 436-448, 2004.
[6] A. Das, J. Gehrke, and M. Riedewald, “Approximate Join Processing over Data Streams,” Proc. ACM SIGMOD, 2003.
[7] U. Srivastava and J. Widom, “Memory-Limited Execution of Windowed Stream Joins,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 2004.
[8] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy, “Join Synopses for Approximate Query Answering,” Proc. ACM SIGMOD, 1999.
[9] C.J. Hahn, S.G. Warren, and J. London, “Edited Synoptic Cloud Reports from Ships and Land Stations over the Globe, 1982-1991,” , 2007.
[10] G. Graefe, “Query Evaluation Techniques for Large Databases,” ACM Computing Surveys, vol. 25, no. 2, 1993.
[11] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” Proc. ACM Symp. Principles of Database Systems (PODS), 2002.
[12] S. Babu and J. Widom, “Continuous Queries over Data Streams,” SIGMOD Record, vol. 30, no. 3, 2001.
[13] L. Golab and M. TamerÖzsu, “Issues in Data Stream Management,” SIGMOD Record, vol. 32, no. 2, 2003.
[14] D. Terry, D. Goldberg, D. Nichols, and B. Oki, “Continuous Queries over Append-Only Databases,” Proc. ACM SIGMOD, 1992.
[15] S. Chandrasekaran and M.J. Franklin, “PSoup: A System for Streaming Queries over Streaming Data,” Very Large Data Bases J., vol. 12, no. 2, 2003.
[16] L. Golab and M. TamerÖzsu, “Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 2003.
[17] M. Hammad, M.J. Franklin, W. Aref, and A. Elmagarmid, “Scheduling for Shared Window Joins over Data Streams,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 2003.
[18] S. Viglas, J.F. Naughton, and J. Burger, “Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 2003.
[19] B. Babcock, M. Datar, and R. Motwani, “Load-Shedding for Aggregation Queries over Data Stream,” Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2004.
[20] J. Kang, J. Naughton, and S. Viglas, “Evaluating Window Joins over Unbounded Streams,” Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2003.
[21] N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker, “Load-Shedding in a Data Stream Manager,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 2003.
[22] N. Alon, Y. Matias, and M. Szegedy, “The Space Complexity of Approximating the Frequency Moments,” Proc. Ann. ACM Symp. Theory of Computing (STOC), 1996.
[23] S. Guha, N. Koudas, and K. Shim, “Data-Streams and Histograms,” Proc. Symp. Theory of Computing, 2001.
[24] A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi, “Processing Complex Aggregate Queries over Data Streams,” Proc. ACM SIGMOD, 2002.
[25] S. Chandrasekaran and M.J. Franklin, “Remembrance of Streams Past: Overload-Sensitive Management of Archived Streams,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 2004.
[26] S. Chaudhuri, R. Motwani, and V. Narasayya, “On Random Sampling over Joins,” Proc. ACM SIGMOD, 1999.
[27] W. Hong and M. Stonebraker, “Optimization of Parallel Query Execution Plans in XPRS,” Distributed and Parallel Databases, vol. 1, no. 1, 1993.
[28] T. Urhan and M. Franklin, “XJOIN: A Reactively-Scheduled Pipelined Join Operator,” IEEE Data Eng. Bull., vol. 23, no. 2, 2000.
[29] J.-P. Dittrich, B. Seeger, D. Taylor, and P. Widmayer, “Progressive Merge Join: A Generic and Non-Blocking Sort-Based Join Algorithm,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 2002.
[30] Y. Tao, M.L. Yiu, D. Papadias, M. Hadjieleftheriou, and N. Mamoulis, “RPJ: Producing Fast Join Results on Streams through Rate-Based Optimization,” Proc. ACM SIGMOD, 2005.
[31] S. Krishnamurthy, S. Chandrasekaran, O. Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Madden, F. Reiss, and M.A. Shah, “TelegraphCQ: An Architectural Status Report,” IEEE Data Eng. Bull., vol. 26, no. 1, 2003.
[32] W. Labio and H. Garcia-Molina, “Efficient Snapshot Differential Algorithms for Data Warehousing,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 1996.
[33] L. Wilburt, J. Wiener, H. Garcia-Molina, and V. Gorelik, “Efficient Resumption of Interrupted Warehouse Loads,” Proc. ACM SIGMOD, 2000.
[34] W. Labio, J. Yang, Y. Cui, H. Garcia-Molina, and J. Widom, “Performance Issues in Incremental Warehouse Maintenance,” Proc. Int'l Conf. Very Large Data Bases (VLDB), 2000.
[35] A. Gupta and I.S. Mumick, “Maintenance of Materialized Views: Problems, Techniques, and Applications,” IEEE Data Eng. Bull., vol. 18, no. 2, 1995.
[36] H. Gupta and I. Mumick, “Incremental Maintenance of Aggregate and Outerjoin Expressions,” to be published in Information Systems.
[37] X. Zhang and E.A. Rundensteiner, “Integrating the Maintenance and Synchronization of Data Warehouses Using a Cooperative Framework,” Information Systems, vol. 27, no. 4, 2002.
[38] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom, “View Maintenance in a Warehousing Environment,” Proc. ACM SIGMOD, 1995.
[39] S. Chen, B. Liu, and E.A. Rundensteiner, “Multiversion-Based View Maintenance over Distributed Data Sources,” ACM Trans. Database Systems, vol. 29, no. 4, pp. 675-709, 2004.

Index Terms:
Query processing, Data warehouse and repository
Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils Frantzell, "Meshing Streaming Updates with Persistent Data in an Active Data Warehouse," IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 7, pp. 976-991, July 2008, doi:10.1109/TKDE.2008.27
Usage of this product signifies your acceptance of the Terms of Use.