The Community for Technology Leaders
2011 IEEE 27th International Conference on Data Engineering (2011)
Hannover, Germany
Apr. 11, 2011 to Apr. 16, 2011
ISBN: 978-1-4244-8959-6
pp: 159-170
Mihaela A. Bornea , Athens U. of Econ and Business, Greece
Antonios Deligiannakis , Technical University of Crete, Greece
Yannis Kotidis , Athens U. of Econ and Business, Greece
Vasilis Vassalos , Athens U. of Econ and Business, Greece
ABSTRACT
Active data warehouses have emerged as a new business intelligence paradigm where data in the integrated repository is refreshed in near real-time. This shift of practices achieves higher consistency between the stored information and the latest updates, which in turn influences crucially the output of decision making processes. In this paper we focus on the changes required in the implementation of Extract Transform Load (ETL) operations which now need to be executed in an online fashion. In particular, the ETL transformations frequently include the join between an incoming stream of updates and a disk-resident table of historical data or metadata. In this context we propose a novel Semi-Streaming Index Join (SSIJ) algorithm that maximizes the throughput of the join by buffering stream tuples and then judiciously selecting how to best amortize expensive disk seeks for blocks of the stored relation among a large number of stream tuples. The relation blocks required for joining with the stream are loaded from disk based on an optimal plan. In order to maximize the utilization of the available memory space for performing the join, our technique incorporates a simple but effective cache replacement policy for managing the retrieved blocks of the relation. Moreover, SSIJ is able to adapt to changing characteristics of the stream (i.e. arrival rate, data distribution) by dynamically adjusting the allocated memory between the cached relation blocks and the stream. Our experiments with a variety of synthetic and real data sets demonstrate that SSIJ consistently outperforms the state-of-the-art algorithm in terms of the maximum sustainable throughput of the join while being also able to accommodate deadlines on stream tuple processing.
INDEX TERMS
CITATION

M. A. Bornea, Y. Kotidis, A. Deligiannakis and V. Vassalos, "Semi-Streamed Index Join for near-real time execution of ETL transformations," 2011 IEEE 27th International Conference on Data Engineering(ICDE), Hannover, Germany, 2011, pp. 159-170.
doi:10.1109/ICDE.2011.5767906
98 ms
(Ver 3.3 (11022016))