This Article 
 Bibliographic References 
 Add to: 
P-AutoClass: Scalable Parallel Clustering for Mining Large Data Sets
May/June 2003 (vol. 15 no. 3)
pp. 629-641
Domenico Talia, IEEE Computer Society

Abstract—Data clustering is an important task in the area of data mining. Clustering is the unsupervised classification of data items into homogeneous groups called clusters. Clustering methods partition a set of data items into clusters, such that items in the same cluster are more similar to each other than items in different clusters according to some defined criteria. Clustering algorithms are computationally intensive, particularly when they are used to analyze large amounts of data. A possible approach to reduce the processing time is based on the implementation of clustering algorithms on scalable parallel computers. This paper describes the design and implementation of P-AutoClass, a parallel version of the AutoClass system based upon the Bayesian model for determining optimal classes in large data sets. The P-AutoClass implementation divides the clustering task among the processors of a multicomputer so that each processor works on its own partition and exchanges intermediate results with the other processors. The system architecture, its implementation, and experimental performance results on different processor numbers and data sets are presented and compared with theoretical performance. In particular, experimental and predicted scalability and efficiency of P-AutoClass versus the sequential AutoClass system are evaluated and compared.

[1] M.S. Aldenderfer and R.K. Blashfield, Cluster Analysis. Sage Publication, 1986.
[2] R. Aggrawal et al., "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1998, pp. 94-105.
[3] K. Alsabti, S. Ranka, and V. Singh, “An Efficient K-Means Clustering Algorithm,” Proc. First Workshop High Performance Data Mining, 1998.
[4] P. Cheeseman, J. Stutz, M. Self, W. Taylor, J. Goebel, K. Volk, and H. Walker, Automatic Classification of Spectra from the Infrared Astronomical Satellite (IRAS), NASA Reference Publication 1217, 1989.
[5] P. Cheeseman and J. Stutz, “Bayesian Classification (AutoClass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press, pp. 61-83, 1996.
[6] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data Via the EM Algorithm,” J. Royal Statistical Soc., Series B, vol. 39, no. 1, pp. 1-38, 1977.
[7] B.S. Everitt and D.J. Hand, Finite Mixture Distribution. London: Chapman and Hall, 1981.
[8] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining, AAAI Press, 1996.
[9] B. Everitt, Cluster Analysis. London: Heinemann Educational Books Ltd, 1977.
[10] A.A. Freitas and S.H. Lavington, Mining Very Large Databases with Parallel Processing. Kluwer Academic Publishers, 1998.
[11] U.M. Fayyad, G. Piatesky-Shapiro, and P. Smith, “From Data Mining to Knowledge Discovery: An Overview,” Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, U.M. Fayyad et al., eds., pp. 1-34, 1996.
[12] G. Folino, G. Spezzano, and D. Talia, “Evaluating and Modeling Communication Overhead of MPI Primitives on the Meiko CS-2,” Recent Advances in Parallel Virtual Machine and Message Passing Interface—Proc. EuroPVM/MPI '98, pp. 27-35, Sept. 1998.
[13] A. Grama, A. Gupta, and V. Kumar, “Isoefficiency Function: A Scalability Metric for Parallel Algorithms and Architectures,” IEEE Trans. Parallel and Distributed Technology, vol. 1, no. 3, pp. 12-21, 1993.
[14] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases Proc. ACM SIGMOD, pp. 73-84, June 1998.
[15] L. Hunter and D.J. States, “Bayesian Classification of Protein Structure,” Expert, vol. 7, no. 4, pp. 67-75, 1992.
[16] G. Karypis, E-H. Han, and V. Kumar, "Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling," Computer, Aug. 1999, pp. 68-75.
[17] B. Kanefsky, J. Stutz, P. Cheeseman, and W. Taylor, “An Improved Automatic Classification of a Landsat/TM Image from Kansas (FIFE),” Technical Report FIA-94-01, NASA Ames Research Center, May 1994.
[18] L. Kaufman and P.J. Rousseew, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.
[19] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[20] M.N. Murty, A.K. Jain, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[21] D. Judd, P. McKinley, and A. Jain, “Performance Evaluation on Large-Scale Parallel Clustering in NOW Environments,” Proc. Eighth SIAM Conf. Parallel Processing for Scientific Computing, Mar. 1997.
[22] D. Judd, P. McKinley, and A.K. Jain, “Large-Scale Parallel Data Clustering,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 871-876, Aug. 1998.
[23] R. Miller and Y. Guo, “Parallelisation of AutoClass,” Proc. Parallel Computing Workshop (PCW '97), 1997.
[24] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144-155.
[25] C.F. Olson, “Parallel Algorithms for Hierarchical Clustering,” Parallel Computing, vol. 21, pp. 1313-1325, 1995.
[26] J.T. Potts, “Seeking Parallelism in Discovery Programs,” master thesis, Univ. of Texas at Arlington, 1996.
[27] K. Stoffel and A. Belkoniene, “Parallel K-Means Clustering for Large Data Sets,” Proc. EuroPar '99—Parallel Processing, pp. 1451-1454, 1999.
[28] D.B. Skillicorn and D. Talia, “Models and Languages for Parallel Computation,” ACM Computing Surveys, vol. 30, no. 2, pp. 123-169, 1998.
[29] D. Talia, “Esplicitazione del Parallelismo nelle Tecniche di Data Mining,” Proc. Sistemi Evoluti per Basi di Dati (SEBD '99), pp. 387-401, June 1999.
[30] D.M. Titterington, A.F.M. Smith, and U.E. Makov, Statistical Analysis of Finite Mixture Distribution. New York: John Wiley and Sons, 1985.
[31] Z. Xu and K. Hwang, "Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2," IEEE Parallel&Distributed Technology, Vol. 4, No. 1, Spring 1996, pp. 9-23.
[32] T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1996, pp. 103-114.

Index Terms:
Data mining, parallel processing, knowledge discovery, data clustering, unsupervised classification, isoefficiency, scalability.
Clara Pizzuti, Domenico Talia, "P-AutoClass: Scalable Parallel Clustering for Mining Large Data Sets," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 629-641, May-June 2003, doi:10.1109/TKDE.2003.1198395
Usage of this product signifies your acceptance of the Terms of Use.