The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.09 - September (2011 vol.23)
pp: 1299-1311
Dawei Jiang , National University of Singapore, Singapore
Anthony K. H. Tung , National University of Singapore, Singapore
Gang Chen , Zhejiang University, Hangzhou
ABSTRACT
Data analysis is an important functionality in cloud computing which allows a huge amount of data to be processed over very large clusters. MapReduce is recognized as a popular way to handle data in the cloud environment due to its excellent scalability and good fault tolerance. However, compared to parallel databases, the performance of MapReduce is slower when it is adopted to perform complex data analysis tasks that require the joining of multiple data sets in order to compute certain aggregates. A common concern is whether MapReduce can be improved to produce a system with both scalability and efficiency. In this paper, we introduce Map-Join-Reduce, a system that extends and improves MapReduce runtime framework to efficiently process complex data analysis tasks on large clusters. We first propose a filtering-join-aggregation programming model, a natural extension of MapReduce's filtering-aggregation programming model. Then, we present a new data processing strategy which performs filtering-join-aggregation tasks in two successive MapReduce jobs. The first job applies filtering logic to all the data sets in parallel, joins the qualified tuples, and pushes the join results to the reducers for partial aggregation. The second job combines all partial aggregation results and produces the final answer. The advantage of our approach is that we join multiple data sets in one go and thus avoid frequent checkpointing and shuffling of intermediate results, a major performance bottleneck in most of the current MapReduce-based systems. We benchmark our system against Hive, a state-of-the-art MapReduce-based data warehouse on a 100-node cluster on Amazon EC2 using TPC-H benchmark. The results show that our approach significantly boosts the performance of complex analysis queries.
INDEX TERMS
Cloud computing, parallel systems, query processing.
CITATION
Dawei Jiang, Anthony K. H. Tung, Gang Chen, "MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 9, pp. 1299-1311, September 2011, doi:10.1109/TKDE.2010.248
REFERENCES
[1] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Proc. Operating Systems Design and Implementation (OSDI), pp. 137-150, 2004.
[2] http://developer.yahoo.net/blogs/hadoop/ 200809/, 2011.
[3] H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D.S. Parker, "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '07), 2007.
[4] D. DeWitt, E. Paulson, E. Robinson, J. Naughton, J. Royalty, S. Shankar, and A. Krioukov, "Clustera: An Integrated Computation and Data Management System," Proc. VLDB Endowment, vol. 1, no. 1, pp. 28-41, 2008.
[5] A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wychoff, and R. Murthy, "Hive—A Warehousing Solution over a Map-Reduce Framework," Proc. VLDB Endowment, vol. 2, no. 2, pp. 1626-1629, 2009.
[6] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, "Pig Latin: A Not-so-Foreign Language for Data Processing," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '08), 2008.
[7] D. DeWitt and J. Gray, "Parallel Database Systems: The Future of High Performance Database Systems," Comm. ACM, vol. 35, no. 6, pp. 85-98, 1992.
[8] D.J. DeWitt, R.H. Gerber, G. Graefe, M.L. Heytens, K.B. Kumar, and M. Muralikrishna, "Gamma—A High Performance Dataflow Database Machine," Proc. 12th Int'l Conf. Very Large Data Bases, pp. 228-237, 1986.
[9] S. Fushimi, M. Kitsuregawa, and H. Tanaka, "An Overview of the System Software of a Parallel Relational Database Machine Grace," Proc. 12th Int'l Conf. Very Large Data Bases, pp. 209-219, 1986.
[10] A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. Dewitt, S. Madden, and M. Stonebraker, "A Comparison of Approaches to Large-Scale Data Analysis," Proc. 35th SIGMOD Int'l Conf. Management of Data (SIGMOD '09), http://database.cs.brown.edu/sigmod09benchmarks-sigmod09.pdf , June 2009.
[11] M. Stonebraker, D. Abadi, D.J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin, "Mapreduce and Parallel DBMSs: Friends or Foes?" Comm. ACM, vol. 53, no. 1, pp. 64-71, 2010.
[12] A. Abouzeid, K. Bajda-Pawlikowski, D.J. Abadi, A. Silberschatz, and A. Rasin, "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads," Proc. VLDB Endowment, vol. 2, no. 1, pp. 922-933, 2009.
[13] http:/hadoop.apache.org, 2011.
[14] K.S. Beyer, V. Ercegovac, R. Krishnamurthy, S. Raghavan, J. Rao, F. Reiss, E.J. Shekita, D.E. Simmen, S. Tata, S. Vaithyanathan, and H. Zhu, "Towards a Scalable Enterprise Content Analytics Platform," IEEE Data Eng. Bull., vol. 32, no. 1, pp. 28-35, Mar. 2009.
[15] D.A. Schneider and D.J. DeWitt, "Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines," Proc. 16th Int'l Conf. Very Large Data Bases. pp. 469-480, 1990.
[16] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, "Interpreting the Data: Parallel Analysis with Sawzall," Scientific Programming, vol. 13, no. 4, pp. 277-298, 2005.
[17] http:/www.cascading.org, 2011.
[18] F.N. Afrati and J.D. Ullman, "Optimizing Joins in a Map-Reduce Environment," Proc. 13th Int'l Conf. Extending Database Technology (EDBT '10), 2010.
[19] http://www.tpc.orgtpch/, 2011.
[20] D.A. Schneider and D.J. DeWitt, "A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment," ACM SIGMOD Record, vol. 18, no. 2, pp. 110-121, 1989.
[21] M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S.R. Madden, E.J. O'Neil, P.E. O'Neil, A. Rasin, N. Tran, and S.B. Zdonik, "C-Store: A Column-Oriented DBMS," Proc. 31st Int'l Conf. Very Large Data Bases (VLDB '05), pp. 553-564, 2005.
[22] M. Ziane, M. Zaït, and P. Borla-Salamet, "Parallel Query Processing with Zigzag Trees," The VLDB J.—Int'l J. Very Large Data Bases—Parallelism in Database Systems, vol. 2, no. 3, pp. 277-302, 1993.
[23] A. Weintraub, "Integer Programming in Forestry," Annals of Operations Research, vol. 149, no. 1, pp. 209-216, 2007.
[24] http://www.cloudera.com/blog/2009/05/07what's-new-in- hadoop-core-020 /, 2011.
[25] http://issues.apache.org/jira/browsehive-600 , 2011.
[26] http://issues.apache.org/jira/browsehive-396 , 2011.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool