The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - Oct. (2012 vol.24)
pp: 1747-1759
Arnab Nandi , The Ohio State University, Columbus
Cong Yu , Google Research, New York
Philip Bohannon , Facebook, Menlo Park
Raghu Ramakrishnan , Microsoft, Redmond
ABSTRACT
Computing interesting measures for data cubes and subsequent mining of interesting cube groups over massive data sets are critical for many important analyses done in the real world. Previous studies have focused on algebraic measures such as SUM that are amenable to parallel computation and can easily benefit from the recent advancement of parallel computing infrastructure such as MapReduce. Dealing with holistic measures such as TOP-K, however, is nontrivial. In this paper, we detail real-world challenges in cube materialization and mining tasks on web-scale data sets. Specifically, we identify an important subset of holistic measures and introduce MR-Cube, a MapReduce-based framework for efficient cube computation and identification of interesting cube groups on holistic measures. We provide extensive experimental analyses over both real and synthetic data. We demonstrate that, unlike existing techniques which cannot scale to the 100 million tuple mark for our data sets, MR-Cube successfully and efficiently computes cubes with holistic measures over billion-tuple data sets.
INDEX TERMS
Lattices, Cities and towns, USA Councils, Data mining, Knowledge engineering, Data engineering, Algorithm design and analysis, holistic measures., Data cube, cube materialization, cube mining, MapReduce
CITATION
Arnab Nandi, Cong Yu, Philip Bohannon, Raghu Ramakrishnan, "Data Cube Materialization and Mining over MapReduce", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 10, pp. 1747-1759, Oct. 2012, doi:10.1109/TKDE.2011.257
REFERENCES
[1] A. Abouzeid et al., "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads," Proc. VLDB Endowment, vol. 2, pp. 922-933, 2009.
[2] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. Naughton, R. Ramakrishnan, and S. Sarawagi, "On the Computation of Multidimensional Aggregates," Proc. 22nd Int'l Conf. Very Large Data Bases (VLDB), 1996.
[3] K. Beyer and R. Ramakrishnan, "Bottom-Up Computation of Sparse and Iceberg CUBEs," Proc. ACM SIGMOD Int'l Conf. Management of Data, 1999.
[4] L. Chen, C. Olston, and R. Ramakrishnan, "Parallel Evaluation of Composite Aggregate Queries," Proc. IEEE 24th Int'l Conf. Data Eng. (ICDE), 2008.
[5] Y. Chen, F.K.H.A. Dehne, T. Eavis, and A. Rau-Chaplin, "PnP: Sequential, External Memory and Parallel Iceberg Cube Computation," J. Distributed and Parallel Databases, vol. 23, pp. 99-126, 2008.
[6] H. Chernoff, "A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the Sum of Observation," Ann. of Math. Statistics, vol. 23, pp. 493-507, 1952.
[7] G. Cormode and S. Muthukrishnan, "The CM Sketch and Its Applications," J. Algorithms, vol. 55, pp. 58-75, 2005.
[8] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Proc. Sixth Conf. Symp. Operating Systems Design and Implementation (OSDI), 2004.
[9] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J.D. Ullman, "Computing Iceberg Queries Efficiently," Proc. 24th Int'l Conf. Very Large Data Bases (VLDB), 1998.
[10] E. Friedman, P. Pawlowski, and J. Cieslewicz, "SQL/MapReduce: A Practical Approach to Self-Describing and Parallelizable User-Defined Functions," Proc. VLDB Endowment, vol. 2, pp. 1402-1413, 2009.
[11] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh, "Data Cube: A Relational Operator Generalizing Group-By, Cross-Tab and Sub-Totals," Proc. 12th Int'l Conf. Data Eng. (ICDE), 1996.
[12] J. Hah, J. Pei, G. Dong, and K. Wang, "Efficient Computation of Iceberg Cubes with Complex Measures," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2001.
[13] V. Harinarayan, A. Rajaraman, and J.D. Ullman, "Implementing Data Cubes Efficiently," Proc. ACM SIGMOD Int'l Conf. Management of Data, 1996.
[14] R. Jin, K. Vaidyanathan, G. Yang, and G. Agrawal, "Communication & Memory Optimal Parallel Datacube Construction," IEEE Trans. Parallel and Distributed Systems, vol. 16, no. 12, pp. 1105-1119, Dec. 2005.
[15] M. Kamber, J. Han, and J. Chiang, "Metarule-Guided Mining of Multi-Dimensional Association Rules Using Data Cubes," Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD), 1997.
[16] S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, "Dremel: Interactive Analysis of Web-Scale Datasets," Proc. 36th Int'l Conf. Very Large Data Bases (VLDB), 2009.
[17] A. Nandi, C. Yu, P. Bohannon, and R. Ramakrishnan, "Distributed Cube Materialization on Holistic Measures," Proc. IEEE 27th Int'l Conf. Data Eng. (ICDE), 2011.
[18] R.T. Ng, A.S. Wagner, and Y. Yin, "Iceberg-Cube Computation with PC Clusters," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2001.
[19] C. Olston et al., "Pig Latin: A Not-so-Foreign Language for Data Processing," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2008.
[20] R. Pike et al., "Interpreting the Data: Parallel Analysis with Sawzall," Scientific Programming, vol. 13, pp. 1-33, 2005.
[21] K. Ross and D. Srivastava, "Fast Computation of Sparse Datacubes," Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997.
[22] S. Sarawagi, "i3: Intelligent Interactive Investigation of Olap Data Cubes," ACM SIGMOD Record, vol. 29, p. 589, 2000.
[23] S. Sarawagi, R. Agrawal, and N. Megiddo, "Discovery-Driven Exploration of OLAP Data Cubes," Proc. Int'l Conf. Extending Database Technology (EDBT), 1998.
[24] K. Sergey and K. Yury, "Applying Map-Reduce Paradigm for Parallel Closed Cube Computation," Proc. First Int'l Conf. Advances in Databases, Knowledge, and Data Applications (DBKDA), 2009.
[25] K.V. Shvachko and A.C. Murthy, "Scaling Hadoop to 4000 Nodes at Yahoo!," Yahoo! Developer Network Blog, 2008.
[26] D. Talbot, "Succinct Approximate Counting of Skewed Data," Proc. 21st Int'l Joint Conf. Artificial Intelligence (IJCAI), 2009.
[27] J. Walker, "Mathematics: Zipf's Law and the AOL Query Database," Fourmilog, 2006.
[28] Y. Xie and D.O. Hallaron, "Locality in Search Engine Queries and its Implications for Caching," Proc. IEEE INFOCOM, 2002.
[29] D. Xin et al., "Star-Cubing: Computing Iceberg Cubes by Top-Down And Bottom-Up Integration," Proc. 29th Int'l Conf. Very Large Data Bases (VLDB), 2003.
[30] J. You, J. Xi, P. Zhang, and H. Chen, "A Parallel Algorithm for Closed Cube Computation," Proc. IEEE/ACIS Int'l Conf. Computer and Information Science (ICIS), 2008.
37 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool