The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2012 vol.24)
pp: 1092-1105
Lukasz Golab , University of Waterloo, Waterloo
Theodore Johnson , AT&T Labs-Research, Florham Park
Vladislav Shkapenyuk , AT&T Labs-Research, Florham Park
ABSTRACT
We discuss update scheduling in streaming data warehouses, which combine the features of traditional data warehouses and data stream systems. In our setting, external sources push append-only data streams into the warehouse with a wide range of interarrival times. While traditional data warehouses are typically refreshed during downtimes, streaming warehouses are updated as new data arrive. We model the streaming warehouse update problem as a scheduling problem, where jobs correspond to processes that load new data into tables, and whose objective is to minimize data staleness over time (at time t, if a table has been updated with information up to some earlier time r, its staleness is t minus r). We then propose a scheduling framework that handles the complications encountered by a stream warehouse: view hierarchies and priorities, data consistency, inability to preempt updates, heterogeneity of update jobs caused by different interarrival times and data volumes among different sources, and transient overload. A novel feature of our framework is that scheduling decisions do not depend on properties of update jobs (such as deadlines), but rather on the effect of update jobs on data staleness. Finally, we present a suite of update scheduling algorithms and extensive simulation experiments to map out factors which affect their performance.
INDEX TERMS
Data warehouse maintenance, online scheduling.
CITATION
Lukasz Golab, Theodore Johnson, Vladislav Shkapenyuk, "Scalable Scheduling of Updates in Streaming Data Warehouses", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 6, pp. 1092-1105, June 2012, doi:10.1109/TKDE.2011.45
REFERENCES
[1] B. Adelberg, H. Garcia-Molina, and B. Kao, "Applying Update Streams in a Soft Real-Time Database System," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 245-256, 1995.
[2] B. Babcock, S. Babu, M. Datar, and R. Motwani, "Chain: Operator Scheduling for Memory Minimization in Data Stream Systems," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 253-264, 2003.
[3] S. Babu, U. Srivastava, and J. Widom, "Exploiting K-constraints to Reduce Memory Overhead in Continuous Queries over Data Streams," ACM Trans. Database Systems, vol. 29, no. 3, pp. 545-580, 2004.
[4] S. Baruah, "The Non-preemptive Scheduling of Periodic Tasks upon Multiprocessors," Real Time Systems, vol. 32, nos. 1/2, pp. 9-20, 2006.
[5] S. Baruah, N. Cohen, C. Plaxton, and D. Varvel, "Proportionate Progress: A Notion of Fairness in Resource Allocation," Algorithmica, vol. 15, pp. 600-625, 1996.
[6] M.H. Bateni, L. Golab, M.T. Hajiaghayi, and H. Karloff, "Scheduling to Minimize Staleness and Stretch in Real-time Data Warehouses," Proc. 21st Ann. Symp. Parallelism in Algorithms and Architectures (SPAA), pp. 29-38, 2009.
[7] A. Burns, "Scheduling Hard Real-Time Systems: A Review," Software Eng. J., vol. 6, no. 3, pp. 116-128, 1991.
[8] D. Carney, U. Cetintemel, A. Rasin, S. Zdonik, M. Cherniack, and M. Stonebraker, "Operator Scheduling in a Data Stream Manager," Proc. 29th Int'l Conf. Very Large Data Bases (VLDB), pp. 838-849, 2003.
[9] J. Cho and H. Garcia-Molina, "Synchronizing a Database to Improve Freshness," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 117-128, 2000.
[10] L. Colby, A. Kawaguchi, D. Lieuwen, I. Mumick, and K. Ross, "Supporting Multiple View Maintenance Policies," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 405-416, 1997.
[11] M. Dertouzos and A. Mok, "Multiprocessor On-Line Scheduling of Hard- Real-Time Tasks," IEEE Trans. Software. Eng., vol. 15, no. 12, pp. 1497-1506, Dec. 1989.
[12] U. Devi and J. Anderson, "Tardiness Bounds under Global EDF Scheduling," Real-Time Systems, vol. 38, no. 2, pp. 133-189, 2008.
[13] N. Folkert, A. Gupta, A. Witkowski, S. Subramanian, S. Bellamkonda, S. Shankar, T. Bozkaya, and L. Sheng, "Optimizing Refresh of a Set of Materialized Views," Proc. 31st Int'l Conf. Very Large Data Bases (VLDB), pp. 1043-1054, 2005.
[14] M. Garey and D. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1979.
[15] L. Golab, T. Johnson, J.S. Seidel, and V. Shkapenyuk, "Stream Warehousing with Datadepot," Proc. 35th ACM SIGMOD Int'l Conf. Management of Data, pp. 847-854, 2009.
[16] L. Golab, T. Johnson, and V. Shkapenyuk, "Scheduling Updates in a Real-Time Stream Warehouse," Proc. IEEE 25th Int'l Conf. Data Eng. (ICDE), pp. 1207-1210, 2009.
[17] H. Guo, P.A. Larson, R. Ramakrishnan, and J. Goldstein, "Relaxed Currency and Consistency: How to Say 'Good Enough' in SQL," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 815-826, 2004.
[18] A. Gupta and I. Mumick, "Maintenance of Materialized Views: Problems, Techniques, and Applications," IEEE Data Eng. Bull., vol. 18, no. 2, pp. 3-18, 1995.
[19] M. Hammad, M. Franklin, W. Aref, and A. Elmagarmid, "Scheduling for Shared Window Joins over Data Streams," Proc. 29th Int'l Conf. Very Large Data Bases (VLDB), pp. 297-308, 2003.
[20] K.-D. Kang, S. Son, and J. Stankovic, "Managing Deadline Miss Ratio and Sensor Data Freshness in Real-Time Databases," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 10, pp. 1200-1216, Oct. 2004.
[21] B. Kao and H. Garcia-Molina, "An Overview of Real-time Database Systems," Advances in Real-Time Systems, S.H. Son, ed., pp. 463-486, Prentice Hall, 1995.
[22] G. Koren and D. Shasha, "Dover, an Optimal On-line Scheduling Algorithm for an Overloaded Real-Time System," Proc. IEEE Real-Time Systems Symp. (RTSS), pp. 292-299, 1992.
[23] G. Koren and D. Shasha, "An Approach to Handling Overloaded Systems that Allow Skips," Proc. IEEE Real-Time Systems Symp. (RTSS), pp. 110-119, 1995.
[24] Y.K. Kwok and I. Ahmad, "Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors," ACM Computing Surveys, vol. 31, no. 4, pp. 406-471, 1999.
[25] W. Labio, R. Yerneni, and H. Garcia-Molina, "Shrinking the Warehouse Update Window," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 383-394, 1999.
[26] A. Labrinidis and N. Roussopoulos, "Update Propagation Strategies for Improving the Quality of Data on the Web," Proc. 27th Int'l Conf. Very Large Data Bases (VLDB), pp. 391-400, 2001.
[27] Y. Oh and S.H. Son, "Tight Performance Bounds of Heuristics for a Real-Time Scheduling Problem," Technical Report CS-93-24, U. Virginia, 1993.
[28] N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N.-E. Frantzell, "Supporting Streaming Updates in an Active Data Warehouse," Proc. IEEE 23rd Int'l Conf. Data Eng. (ICDE), pp. 476-485, 2007.
[29] M. Sharaf, P. Chrysanthis, A. Labrinidis, and K. Pruhs, "Algorithms and Metrics for Processing Multiple Heterogeneous Continuous Queries," ACM Trans. Database Systems, vol. 33, no. 1, pp. 1-44, 2008.
[30] R. Srinivasan, C. Liang, and K. Ramamritham, "Maintaining Temporal Coherency of Virtual Data Warehouses," Proc. IEEE 19th Real-Time Systems Symp. (RTSS), pp. 60-70, 1998.
[31] N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker, "Load Shedding in a Data Stream Manager," Proc. 29th Int'l Conf. Very Large Data Bases (VLDB), pp. 309-320, 2003.
[32] C. Thomsen, T.B. Pedersen, and W. Lehner, "RiTE: Providing On-Demand Data for Right-Time Data Warehousing," Proc. IEEE 24th Int'l Conf. Data Eng. (ICDE), pp. 456-465, 2008.
[33] P. Tucker, "Punctuated Data Streams," PhD thesis, Oregon Health & Science Univ., 2005.
[34] H. Qu and A. Labrinidis, "Preference-Aware Query and Update Scheduling in Web-Databases," Proc. IEEE 23rd Int'l Conf. Data Eng. (ICDE), pp. 356-365, 2007.
[35] Y. Zhuge, J. Wiener, and H. Garcia-Molina, "Multiple View Consistency for Data Warehousing," Proc. IEEE 13th Int'l Conf. Data Eng. (ICDE), pp. 289-300, 1997.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool