This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Effect of Data Skewness and Workload Balance in Parallel Data Mining
May/June 2002 (vol. 14 no. 3)
pp. 498-514

Abstract—To mine association rules efficiently, we have developed a new parallel mining algorithm FPM on a distributed share-nothing parallel system in which data are partitioned across the processors. FPM is an enhancement of the FDM algorithm, which we previously proposed for distributed mining of association rules. FPM requires fewer rounds of message exchanges than FDM and, hence, has a better response time in a parallel environment. The algorithm has been experimentally found to outperform CD, a representative parallel algorithm for the same goal. The efficiency of FPM is attributed to the incorporation of two powerful candidate sets pruning techniques: distributed and global prunings. The two techniques are sensitive to two data distribution characteristics, data skewness, and workload balance. Metrics based on entropy are proposed for these two characteristics. The prunings are very effective when both the skewness and balance are high. In order to increase the efficiency of FPM, we have developed methods to partition a database so that the resulting partitions have high balance and skewness. Experiments have shown empirically that our partitioning algorithms can achieve these aims very well, in particular, the results are consistently better than a random partitioning. Moreover, the partitioning algorithms incur little overhead. So, using our partitioning algorithms and FPM together, we can mine association rules from a database efficiently.

[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACM-SIGMOD Int'l Conf. Management of Data, pp. 207-216, May 1993.
[2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int'l Conf. Very Large Data Bases, pp. 487-499, Sept. 1994.
[3] R. Agrawal and J.C. Shafer, “Parallel Mining of Association Rules: Design, Implementation and Experience,” Technical Report TJ10004, IBM Research Division, Almaden Research Center, 1996.
[4] S. Brin, R. Motwani, J. Ullman, and S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data,” ACM SIGMOD Conf. Management of Data, May 1997.
[5] T.M. Cover and J.A. Thomas, Elements of Information Theory. John Wiley&Sons, 1991.
[6] D. Cheung, J. Han, V. Ng, and C.Y. Wong, Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique Proc. 1996 Int'l Conf. Data Eng., pp. 106-114, Feb. 1996.
[7] D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu, “A Fast Distributed Algorithm for Mining Association Rules,” Fourth Int'l Conf. Parallel and Distributed Information Systems, Dec. 1996.
[8] D.W. Cheung, V.T. Ng, W. Fu, and Y. Fu, “Efficient Mining Association Rules in Distributed Databases,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 6, pp. 911-922, Dec. 1996.
[9] S.K. Gupta, Linear Programming and Network Models. New Delhi: Affiliated East-West Press, 1985.
[10] E.-H. Han, G. Karypis, and V. Kumar, “Scalable Parallel Data Mining for Association Rules,” ACM SIGMOD Conf. Management of Data, May 1997.
[11] J. Han and Y. Fu, “Discovery of Multiple-Level Association Rules from Large Databases,” Proc. 1995 Int'l Conf. Very Large Data Bases, pp. 420-431, Sept. 1995.
[12] M. Houtsma and A. Swami, “Set-Oriented Mining of Association Rules in Relational Databases,” 11th Int'l Conf. Data Eng., 1995.
[13] International Business Machines, Scalable POWERparallel Systems, GA23-2475-02 ed. Feb. 1995.
[14] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.
[15] H. Mannila, H. Toivonen, and A.I. Verkamo, “Efficient Algorithms for Discovering Association Rules,” AAAI Workshop Knowledge Discovery in Databases (KDD-94), July 1994.
[16] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard. May 1994.
[17] J.S. Park, M.S. Chen, and P.S. Yu, “An Effective Hash-Based Algorithm for Mining Association Rules,” Proc. 1995 ACM-SIGMOD Int'l Conf. Management of Data, pp. 175-186, May 1995.
[18] J. Park, M. Chen, and P. Yu, Efficient Parallel Data Mining for Association Rules Proc. Fourth Int'l Conf. Information and Knowledge Management, pp. 31-36, 1995.
[19] A. Savasere, E. Omiecinski, and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases,” Proc. 1995 Int'l Conf. Very Large Data Bases, pp. 432-443, Sept. 1995.
[20] R. Srikant and R. Agrawal, “Mining Generalized Association Rules,” Proc. 1995 Int'l Conf. Very Large Data Bases, pp. 407-419, Sept. 1995.
[21] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements,” Proc. Fifth Int'l Conf. Extending Database Technology (EDBT), pp. 3-17, 1996.
[22] R. Srikant and R. Agrawal, “Mining Quantitative Association Rules in Large Relational Tables,” Proc. 1996 ACM-SIGMOD Int'l Conf. Management of Data, pp. 1-12, June 1996.
[23] T. Shintani and M. Kitsuregawa, Hash Based Parallel Algorithms for Mining Association Rules Proc. Conf. Parallel and Distributed Information Systems, pp. 19-30, 1996.
[24] H. Toivonen, “Sampling Large Databases for Association Rules,” Proc. 1996 Int'l Conf. Very Large Data Bases, pp. 134-145, Sept. 1996.
[25] M.J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li, “Parallel Data Mining for Association Rules on Shared-Memory Multi-Processors,” Technical Report 618, Computer Science Dept., The Univ. of Rochester, May 1996.

Index Terms:
Association rules, data mining, data skewness, workload balance, parallel mining, partitioning
Citation:
D.W. Cheung, S.D. Lee, V. Xiao, "Effect of Data Skewness and Workload Balance in Parallel Data Mining," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 3, pp. 498-514, May-June 2002, doi:10.1109/TKDE.2002.1000339
Usage of this product signifies your acceptance of the Terms of Use.