This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance
January 2005 (vol. 17 no. 1)
pp. 71-89
Gagan Agrawal, IEEE Computer Society
With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cache-sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance.

[1] R. Agrawal, T. Imielinski, and A. Swami, “Database Mining: A Performance Perspective,” IEEE Trans. Knowledge and Data Eng., vol. 5, no. 6, pp. 914-925, Dec. 1993.
[2] R. Agrawal and J. Shafer, “Parallel Mining of Association Rules,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 6, pp. 962-969, June 1996.
[3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int'l Conf. Very Large Databases (VLDB '94), pp. 487-499, Sept. 1994.
[4] K. Alsabti, S. Ranka, and V. Singh, “Clouds: Classification for Large or Out-of-Core Datasets,” http://www.cise.ufl.edu/rankadm.html, 1998.
[5] P. Becuzzi, M. Coppola, and M. Vanneschi, “Mining of Association Rules in Very Large Databases: A Structured Parallel Approach,” Proc. Europar-99, vol. 1685, pp. 1441-1450, Aug. 1999.
[6] W. Blume, R. Doallo, R. Eigenman, J. Grout, J. Hoelflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu, “Parallel Programming with Polaris,” Computer, vol. 29, no. 12, pp. 78-82, Dec. 1996.
[7] R.D. Blumofe, C.F. Joerg, et al., “Cilk: An Efficient Multithreaded Runtime System,” , Proc. Fifth ACM Conf. Principles and Practices of Parallel Programming (PPoPP), 1995.
[8] S. Brin, R. Motwani, J. Ullman, and S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data,” Proc. ACM SIGMOD Conf. Management of Data, May 1997.
[9] P. Cheeseman and J. Stutz, “Bayesian Classification (Autoclass): Theory and Practice,” Advanced in Knowledge Discovery and Data Mining, pp. 61-83, 1996.
[10] I.S. Dhillon and D.S. Modha, “A Data-Clustering Algorithm on Distributed Memory Multiprocessors,” Proc. Workshop Large-Scale Parallel KDD Systems, in conjunction with the Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '99), pp. 47-56, Aug. 1999.
[11] R. Ferreira, G. Agrawal, and J. Saltz, “Compiling Object-Oriented Data Intensive Computations,” Proc. 2000 Int'l Conf. Supercomputing, May 2000.
[12] G. Forman and B. Zhang, “Distributed Data Clustering Can be Efficient and Exact,” Proc. SIGKDD Explorations, vol. 2, no. 2, Dec. 2000.
[13] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh, “Boat— Optimistic Decision Tree Construction,” Proc. ACM SIGMOD Conf. Management of Data, June 1999.
[14] J. Gehrke, R. Ramakrishnan, and V. Ganti, “Rainforest— A Framework for Fast Decision Tree Construction of Large Datasets,” Proc. Conf. Very Large Databases (VLDB), 1998.
[15] S. Goil and A. Choudhary, “Efficient Parallel Classification Using Dimensional Aggregates,” Proc. Workshop Large-Scala Parallel KDD Systems, with ACM SIGKDD-99, Aug. 1999.
[16] S. Goil and A. Choudhary, “PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining,” J. Parallel and Distributed Computing, vol. 61, no. 3, pp. 285-321, Mar. 2001.
[17] E. Gutierrez, O. Plata, and E.L. Zapata, “A Compiler Method for the Parallel Execution of Irregular Reductions in Scalable Shared Memory Multiprocessors,” Proc. Int'l Conf. Supercomputing (ICS '00), pp. 78-87, May 2000.
[18] M. Hall, S. Amarsinghe, B. Murphy, S. Liao, and M. Lam, “Maximizing Multiprocessor Performance with the SUIF Compiler,” Computer, no. 12, Dec. 1996.
[19] E.-H. Han, G. Karypis, and V. Kumar, “Scalable Parallel Datamining for Association Rules,” Proc. ACM SIGMOD 1997, May 1997.
[20] E-H. Han, G. Karypis, and V. Kumar, “Scalable Parallel Datamining for Association Rules,” IEEE Trans. Data and Knowledge Eng., vol. 12, no. 3, May/June 2000.
[21] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. ACM SIGMOD Conf. Management of Data, 2000.
[22] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.
[23] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, 2nd ed. Morgan Kaufmann, Inc., 1996.
[24] IBM. Db2, “Universal Database Goes Parallel with Enterprise and Enterprise-Extended Editions,” http://www-4.ibm.com/software/data/db2/udb 98eeebrochure, 1999.
[25] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.
[26] R. Jin and G. Agrawal, “An Efficient Implementation of Apriori Association Mining on Cluster of SMPs,” Proc. Workshop High Performance Data Mining (IPDPS 2001), Apr. 2001.
[27] R. Jin and G. Agrawal, “A Middleware for Developing Parallel Data Mining Implementations,” Proc. First SIAM Conf. Data Mining, Apr. 2001.
[28] R. Jin and G. Agrawal, “Performance Prediction for Random Write Reductions: A Case Study in Modeling Shared Memory Programs,” Proc. ACM SIGMETRICS, June 2002.
[29] M.V. Joshi, G. Karypis, and V. Kumar, “Scalparc: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets,” Proc. Int'l Parallel Processing Symp., 1998.
[30] A. Kagi, “Mechanisms for Efficient Shared-Memory, Lock-Based Sychronization,” PhD thesis, Univ. of Wisconsin, Madison, 1999.
[31] A. Kagi, D. Burger, and J.R. Goodman, “Efficient Synchronization: Let Them Eat QOLB,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 170-180, June 1997.
[32] X. Li, R. Jin, and G. Agrawal, “A Compilation Framework for Distributed Memory Parallelizattion of Data Mining Algorithms,” submitted for publication, 2002.
[33] X. Li, R. Jin, and G. Agrawal, “Compiler and Runtime Support for Shared Memory Parallelization of Data Mining Algorithms,” Proc. Conf. Language and Compilers for Parallel Computing, Aug. 2002.
[34] Y. Lin and D. Padua, “On the Automatic Parallelization of Sparse and Irregular Fortran Programs,” Proc. Workshop Languages, Compilers, and Runtime Systems for Scalable Computers (LCR-98), May 1998.
[35] H. Lu, A.L. Cox, S. Dwarkadas, R. Rajamony, and W. Zwaenepoel, “Compiler and Software Distributed Shared Memory Support for Irregular Applications,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP), pp. 48-56, June 1997.
[36] T. Mahapatra and S. Mishra, Oracle Parallel Processing. O'Reilly Publishers, 2000.
[37] M. Mehta, R. Agrawal, and J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining,” Proc. Fifth Int'l Conf. Extending Database Technology, 1996.
[38] A. Mueller, “Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison,” Technical Report CS-TR-3515, Univ. of Maryland, College Park, Aug. 1995.
[39] S.K. Murthy, “Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey,” Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 345-389, 1998.
[40] G.J. Narlikar, “A Parallel, Multithreaded Decision Tree Builder,” Technical Report CMU-CS-98-184, School of Computer Science, Carnegie Mellon Univ., 1998.
[41] C.R. Palmer and C. Faloutsos, “Density Biases Sampling: An Improved Method for Data Mining and Clustering,” Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data, June 2000.
[42] J.S. Park, M. Chen, and P.S. Yu, “An Effecitive Hash Based Algorithm for Mining Association Rules,” Proc. ACM SIGMOD Int'l Conf. Management of Data, May 1995.
[43] S. Parthasarathy, M. Zaki, and W. Li, “Memory Placement Techniques for Parallel Association Mining,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining (KDD), Aug. 1998.
[44] S. Parthasarathy, M. Zaki, M. Ogihara, and W. Li, “Parallel Data Mining for Association Rules on Shared-Memory Systems,” Knowledge and Information Systems, to appear, 2000.
[45] F. Provost and V. Kolluri, “A Survey of Methods for Scaling up Inductive Algorithms,” Knowledge Discovery and Data Mining, vol. 3, 1999.
[46] J.R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann, 1993.
[47] S. Ruggieri, “Efficient C4.5,” Technical Report TR-00-01, Dept. of Information, Univ. of Pisa, Feb. 1999.
[48] J.H. Saltz, R. Mirchandaney, and K. Crowley, “Run-Time Parallelization and Scheduling of Loops,” IEEE Trans. Computers, vol. 40, no. 5, pp. 603-612, May 1991.
[49] A. Savasere, E. Omiecinski, and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases,” Proc. 21st Conf. Very Large Databases (VLDB), 1995.
[50] J. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data Mining,” Proc. 22nd Int'l Conf. Very Large Databases (VLDB), pp. 544-555, Sept. 1996.
[51] A. Shatdal, “Architectural Considerations for Parallel Query Evaluation Algorithms,” Technical Report CS-TR-1996-1321, Univ. of Wisconsin, 1999.
[52] D.B. Skillicorn, “Strategies for Parallel Data Mining,” IEEE Concurrency, Oct./Dec. 1999.
[53] A. Srivastava, E. Han, V. Kumar, and V. Singh, “Parallel Formulations of Decision-Tree Classification Algorithms,” Proc. 1998 Int'l Conf. Parallel Processing, 1998.
[54] H. Yu and L. Rauchwerger, “Adaptive Reduction Parallelization Techniques,” Proc. 2000 Int'l Conf. Supercomputing, pp. 66-75, May 2000.
[55] M.J. Zaki, C.-T. Ho, and R. Agrawal, “Parallel Classification for Data Mining on Shared-Memory Multiprocessors,” Proc. IEEE Int'l Conf. Data Eng., pp. 198-205, May 1999.
[56] M.J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li, “Parallel Data Mining for Association Rules on Shared Memory Multiprocessors,” Proc. Conf. Supercomputing '96, Nov. 1996.
[57] M.J. Zaki, “Parallel and Distributed Association Mining: A Survey,” IEEE Concurrency, vol. 7, no. 4, pp. 14-25, 1999.

Index Terms:
Shared memory parallelization, programming interfaces, association mining, clustering, decision tree construction.
Citation:
Ruoming Jin, Ge Yang, Gagan Agrawal, "Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 1, pp. 71-89, Jan. 2005, doi:10.1109/TKDE.2005.18
Usage of this product signifies your acceptance of the Terms of Use.