This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Parallel Star Join + DataIndexes: Efficient Query Processing in Data Warehouses and OLAP
November/December 2002 (vol. 14 no. 6)
pp. 1299-1316

AbstractOn-Line Analytical Processing (OLAP) refers to the technologies that allow users to efficiently retrieve data from the data warehouse for decision-support purposes. Data warehouses tend to be extremely large—it is quite possible for a data warehouse to be hundreds of gigabytes to terabytes in size. Queries tend to be complex and ad hoc, often requiring computationally expensive operations such as joins and aggregation. Given this, we are interested in developing strategies for improving query processing in data warehouses by exploring the applicability of parallel processing techniques. In particular, we exploit the natural partitionability of a star schema and render it even more efficient by applying DataIndexes—a storage structure that serves both as an index as well as data and lends itself naturally to vertical partitioning of the data. Dataindexes are derived from the various special purpose access mechanisms currently supported in commercial OLAP products. Specifically, we propose a declustering strategy which incorporates both task and data partitioning and present the Parallel Star Join (PSJ) Algorithm, which provides a means to perform a star join in parallel using efficient operations involving only rowsets and projection columns. We compare the performance of the PSJ Algorithm with two parallel query processing strategies. The first is a parallel join strategy utilizing the Bitmap Join Index (BJI), arguably the state-of-the-art OLAP join structure in use today. For the second strategy we choose a well-known parallel join algorithm, namely the pipelined hash algorithm. To assist in the performance comparison, we first develop a cost model of the disk access and transmission costs for all three approaches. Performance comparisons show that the DataIndex-based approach leads to dramatically lower disk access costs than the BJI, as well as the hybrid hash approaches, in both speedup and scaleup experiments, while the hash-based approach outperforms the BJI in disk access costs. With regard to transmission overhead, our performance results show that PSJ and BJI outperform the hash-based approach. Overall, our parallel star join algorithm and dataindexes form a winning combination.

[1] Parallel Database Techniques, M. Abdelguerfi and K.-F. Wong, eds. IEEE CS Press: 1998.
[2] C. Ballinger and R. Fryer, “Born to Be Parallel: Why Parallel Origins Give Teradata an Enduring Performance Edge,” Proc. Eighth Int'l Hong Kong Computer Soc. Database Workshop, pp. 29-31, 1997.
[3] S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing and OLAP Technology,” SIGMOD Record, vol. 26, no. 1, Mar. 1997.
[4] M.S. Chen, M.L. Lo, P.S. Yu, and H.E. Young, “Applying Segmented Right-Deep Trees to Pipelining Multiple Hash Joins,” IEEE Trans. Knowledge and Data Eng., vol. 7, no. 4, Aug. 1995.
[5] M.-S. Chen, P.S. Yu, and K.-L. Wu, “Optimization of Parallel Execution for Multi-Join Queries,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 3, pp. 416-428, June 1996.
[6] A. Datta, B. Moon, K. Ramamritham, H. Thomas, and I. Viguier, “Have Your Data and Index It, Too: Efficient Storage and Indexing for Data Warehouses,” Technical Report tr98-07, Univ. of Arizona, 1998.
[7] A. Datta, D. VanderMeer, and K. Ramamritham, “Tr2001-15: Applying Parallel Processing Techniques in Data Warehousing and OLAP,” technical report, Chutney Tech nologies, 2001.
[8] D. DeWitt and J. Gray, “Parallel Database Systems: The Future of High-Performance Database Systems,” Comm. ACM, Vol. 35, No. 6, June 1992, pp. 85-98.
[9] D. DeWitt, J. Naughton, D. Schneider, and S. Seshadri,“Practical skew handling in parallel joins,”inProc. 18th Int. Conf. Very Large Databases, Vancouver, B.C., Aug. 1992, pp. 27–40.
[10] S. Engelbert, J. Gray, T. Kocher, and P. Stah, “A Benchmark of Non-Stop SQL Release 2 Demonstrating Near-Linear Speedup and Scaleup on Large Databases,” Technical Report 89.4, Tandem Computers, Tandem Part No. 27469, May 1989.
[11] P.M. Fernandez, “Red Brick Warehouse: A Read-Mostly RDBMS for Open SMP Platforms,” Proc. ACM SIGMOD, pp. 492, May 1994.
[12] C.D. French, “Teaching an OLTP Database Kernel Advanced Datawarehousing Techniques,” Proc. 13th Int'l Conf. Data Eng. (ICDE), pp. 194-198, Apr. 1997
[13] G. Hallmark, “Oracle Parallel Warehouse Server,” Proc. 13th Int'l Conf. Data Eng., pp. 314-320, Apr. 1997.
[14] V. Harinarayan, A. Rajaraman, and J. D. Ullman, “Implementing Data Cubes Efficiently,” Proc. ACM SIGMOD, pp. 205-216, June 1996
[15] W.H. Inmon, Building the Data Warehouse, second ed. John Wiley and Sons, 1996.
[16] M. Kitsuregawa, H. Imai, and W. Loe, “Introduction to the Super Database Computer, sdc-ii,” Proc. JSPS-NUS Seminar on Computing, pp. 101-113, 1995.
[17] C. Lee and Z. Chang, “Workload Balance and Page Access Scheduling for Parallel Joins in Shared-Nothing Systems,” Proc. Int'l Conf. Data Eng. (ICDE), pp. 411-418, 1993.
[18] M.-L. Lo, M.-S. Chen, C. V. Ravishankar, and P. S. Yu,“On optimal processor allocation to support pipelined hash joins,”inProc. ACM SIGMOD, May 1993, pp. 69–78.
[19] Query Processing in Parallel Relational Database Systems, H. Lu, B.C. Ooi, and K.L. Tan, eds. IEEE CS Press: 1994.
[20] K. Lyons, “Hosting Massive Databases on the Legion Cluster,” technical report, AT&T Research Labs, internal report, Feb. 1997.
[21] P. O'Neil and G. Graefe, “Multitable Joins through Bitmapped Join Indices,” SIGMOD Record, vol. 24, no. 3, pp. 8-11, Sept. 1995.
[22] B.M. Waxman, Performance Evaluation of Multipoint Routing Algorithms Proc. IEEE INFOCOM, pp. 980-986, Mar. 1993.
[23] “Red Brick Systems. Star Schemas and STARjoin Technology,” white paper, Sept. 1995.
[24] D. Schneider and D. DeWitt, “A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment,” ACM SIGMOD Record, vol. 18, no. 2, pp. 110-121, June 1989.
[25] M. Seetha and P. Yu, “Effectiveness of Parallel Joins,” IEEE Trans. Knowledge and Data Eng., vol. 2, no. 4, pp. 410-424, Aug. 1990.
[26] A. Shatdal and J. Naughton, “Using Shared Virtual Memory for Parallel Join Processing,” Proc. ACM SIGMOD, pp. 119-128, June 1993.
[27] Transaction Processing Performance Council, TPC Benchmark D (Decision Support) Standard Specification, revision 1.2.3 ed., June 1997.
[28] J. Wolf, D. Dias, P. Yu, and J. Turek, “Comparative Performance of Parallel Join Algorithms,” Proc. First Int'l Conf. Parallel and Distributed Information Systems (PDIS), pp. 78-88, 1991.

Index Terms:
Parallel star join, OLAP, query processing, dataindexes.
Citation:
Anindya Datta, Debra VanderMeer, Krithi Ramamritham, "Parallel Star Join + DataIndexes: Efficient Query Processing in Data Warehouses and OLAP," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 6, pp. 1299-1316, Nov.-Dec. 2002, doi:10.1109/TKDE.2002.1047769
Usage of this product signifies your acceptance of the Terms of Use.