This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Communication and Memory Optimal Parallel Data Cube Construction
December 2005 (vol. 16 no. 12)
pp. 1105-1119

Abstract—Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. This paper addresses a number of algorithmic issues in parallel data cube construction. First, we present an aggregation tree for sequential (and parallel) data cube construction which has minimally bounded memory requirements. An aggregation tree is parameterized by the ordering of dimensions. We present a parallel algorithm based upon the aggregation tree. We analyze the interprocessor communication volume and construct a closed form expression for it. We prove that the same ordering of the dimensions in the aggregation tree minimizes both the computational and communication requirements. We also describe a method for partitioning the initial array and prove that it minimizes the communication volume. Finally, in the cases when memory may be a bottleneck, we describe how tiling can help scale sequential and parallel data cube construction. Experimental results from implementation of our algorithms on a cluster of workstations show the effectiveness of our algorithms and validate our theoretical results.

[1] K. Beyer and R. Ramakrishnan, “Bottom Up Computation of Sparse and Iceberg Cubes,” Proc. ACM SIGMOD, pp. 359-370, 1999.
[2] T.H. Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms. McGraw Hill 1990.
[3] A. Datta, B. Moon, and H. Thomas, “A Case for Parallelism in Data Warehousing,” Proc. Conf. Database and Expert Systems Applications, 1998.
[4] F. Dehne, T. Eavis, S. Hambrusch, and A. Rau-Chaplin, “Parallelizing the Data Cube,” Distributed and Parallel Databases: An Int'l J., special issue on parallel and distributed data mining, 2002.
[5] Y. Feng, D. Agrawal, A.E. Abbadi, and A. Metwally, “Range CUBE: Efficient Cube Computation by Exploiting Data Correlation,” Proc. Int'l Conf. Data Eng. (ICDE), Mar. 2004.
[6] H. Garcia-Molina, W.J. Labio, J.L. Wiener, and Y. Zhuge, “Distributed and Parallel Computing in Data Warehousing (Invited Talk),” Proc. Conf. Principles of Database Systems (PODS), June 1998.
[7] S. Goil and A. Choudhary, “High Performance OLAP and Data Mining on Parallel Computers,” Technical Report CPDC-TR-97-05, Center for Parallel and Distributed Computing, Northwestern Univ., Dec. 1997.
[8] S. Goil and A. Choudhary, “PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining,” J. Parallel and Distributed Computing, vol. 61, no. 3, pp. 285-321, Mar. 2001.
[9] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “Data Cube: A Relational Aggregational Operator for Generalizing Group-Bys, Cross-Tabs, and Sub-Totals,” Technical Report MSR-TR-95-22, Microsoft Research, 1995.
[10] J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation of Iceberg Cube with Complex Measures,” Proc. SIGMOD Int'l Conf. Management of Data (SIGMOD), 2001.
[11] V. Harinarayan, A. Rajaraman, and J.D. Ullman, “Implementing Data Cubes Efficiently,” Proc. ACM SIGMOD Conf., June 1996.
[12] R. Jin, K. Vaidyanathan, G. Yang, and G. Agrawal, “Communication and Memory Optimal Parallel Data Cube Construction,” Technical Report OSU-CISRC-9/04-TR50, Dept. Computer Science and Eng., The Ohio State Univ., Sept. 2004.
[13] A.Y. Levy, “Answering Queries Using Views: A Survey,” http://www.cs.washington.edu/homes/alon/ site/filesview-survey.ps, 2000.
[14] C. Li, M. Bawa, and J.D. Ullman, “Minimizing View Sets without Losing Query-Answering Power,” Proc. Eighth Int'l Conf. Database Theory (ICDT), Jan. 2001.
[15] K. Ross and D. Srivastava, “Fast Computation of Sparse Datacubes,” Proc. 23rd Int'l Conf. Very Large Data Bases, pp. 263-277, Aug. 1997.
[16] S. Agrawal, R. Agrawal, P.M. Desphpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Sarawagi, “On the Computation of Multidimensional Aggregates,” Proc 1996 Int'l Conf. Very Large Data Bases, pp. 506-521, Sept. 1996.
[17] Y. Sismanis, A. Deligiannakis, N. Roussopolous, and Y. Kotidis, “Dwarf: Shrinking the Petacube,” Proc. ACM SIGMOD Int'l Conf. Management of Data, June 2002.
[18] T. Stohr, H. Martens, and E. Rahm, “Multi-Dimensional Database Allocation for Parallel Data Warehouses,” Proc. Conf. Very Large DataBases (VLDB), Aug. 2000.
[19] Y.J. Tam, “Datacube: Its Implementation and Application in OLAP Mining,” master's thesis, Simon Fraser Univ., Sept. 1998.
[20] G. Yang, R. Jin, and G. Agrawal, “Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience and Performance Evaluation,” Proc. Second IEEE Int'l Symp. Cluster Computing and the Grid (CCGrid 2002), May 2002.
[21] Y. Zhao, P.M. Deshpande, and J.F. Naughton, “An Array Based Algorithm for Simultaneous Multidimensional Aggregates,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 159-170, June 1997.

Index Terms:
Data warehouses, OLAP, parallel algorithms, communication analysis.
Citation:
Ruoming Jin, Karthikeyan Vaidyanathan, Ge Yang, Gagan Agrawal, "Communication and Memory Optimal Parallel Data Cube Construction," IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 12, pp. 1105-1119, Dec. 2005, doi:10.1109/TPDS.2005.144
Usage of this product signifies your acceptance of the Terms of Use.