This Article 
 Bibliographic References 
 Add to: 
Scalable Model-Based Clustering for Large Databases Based on Data Summarization
November 2005 (vol. 27 no. 11)
pp. 1710-1719
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources such as memory and computation time. In this paper, two scalable clustering algorithms, bEMADS and gEMADS, are presented based on the Gaussian mixture model. Both summarize data into subclusters and then generate Gaussian mixtures from their data summaries. Their core algorithm, EMADS, is defined on data summaries and approximates the aggregate behavior of each subcluster of data under the Gaussian mixture model. EMADS is provably convergent. Experimental results substantiate that both algorithms can run several orders of magnitude faster than expectation-maximization with little loss of accuracy.

[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann, 2001.
[2] P. Bradley, U. Fayyad, and C. Reina, “Clustering Very Large Databases Using EM Mixture Models,” Proc. 15th Int'l Conf. Pattern Recognition, vol. 2, pp. 76-80, 2000.
[3] V. Ganti, J. Gehrke, and R. Ramakrishnan, “Mining Very Large Databases,” Computer, vol. 32, no. 8, pp. 38-45, Aug. 1999.
[4] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A New Data Clustering Algorithm and Its Applications,” Data Mining and Knowledge Discovery, vol. 1, no. 2, pp. 141-182, 1997.
[5] H.-D. Jin, M.-L. Wong, and K.-S. Leung, “Scalable Model-Based Clustering by Working on Data Summaries,” Proc. Third IEEE Int'l Conf. Data Mining, pp. 91-98, Nov. 2003.
[6] B. Thiesson, C. Meek, and D. Heckerman, “Accelerating EM for Large Databases,” Machine Learning, vol. 45, pp. 279-299, 2001.
[7] A. Moore, “Very Fast EM-Based Mixture Model Clustering Using Multiresolution KD-Trees,” Advances in Neural Information Processing Systems 11, pp. 543-549, 1999.
[8] C. Palmer and C. Faloutsos, “Density Biased Sampling: An Improved Method for Data Mining and Clustering,” Proc. 2000 ACM SIGMOD, pp. 82-92, 2000.
[9] M. Meila and D. Heckerman, “An Experimental Comparison of Model-Based Clustering Methods,” Machine Learning, vol. 42, no. 1/2, pp. 9-29, 2001.
[10] H.-D. Jin, “Scalable Model-Based Clustering Algorithms for Large Databases and Their Applications,” PhD thesis, The Chinese Univ. of Hong Kong, Hong Kong, Aug. 2002, see errata, codes, and data at .
[11] P.A. Pantel, “Clustering by Committee,” PhD dissertation, Univ. of Alberta, Canada, 2003.
[12] M. Figueiredo and A.K. Jain, “Unsupervised Learning of Finite Mixture Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 381-396, Mar. 2002.
[13] S. Wang, D. Schuurmans, F. Peng, and Y. Zhao, “Learning Mixture Models with the Latent Maximum Entropy Principle,” Proc. 20th Int'l Conf. Machine Learning, pp. 784-791, 2003.
[14] A. Dempster, N. Laird, and D. Rubin, “Maximum-Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. Series B, vol. 39, pp. 1-38, 1977.
[15] H.-D. Jin, K.-S. Leung, M.-L. Wong, and Z.-B. Xu, “Scalable Model-Based Cluster Analysis Using Clustering Features,” Pattern Recognition, vol. 38, no. 5, pp. 637-649, May 2005.
[16] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions. New York: John Wiley & Sons, Inc., 1997.
[17] P. Cheeseman and J. Stutz, “Bayesian Classification (AutoClass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al., eds., pp. 153-180, 1996.
[18] B.J. Frey and N. Jojic, “Transformation-Invariant Clustering Using the EM Algorithm,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 1, pp. 1-17, Jan. 2003.
[19] C. Fraley, “Algorithms for Model-Based Gaussian Hierarchical Clustering,” SIAM J. Scientific Computing, vol. 20, no. 1, pp. 270-281, Jan. 1999.
[20] J. Shanmugasundaram, U. Fayyad, and P. Bradley, “Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions,” Proc. Fifth ACM SIGKDD, pp. 223-232, 1999.

Index Terms:
Index Terms- Scalable clustering, Gaussian mixture model, expectation-maximization, data summary, maximum penalized likelihood estimate.
Huidong Jin, Man-Leung Wong, K.-S. Leung, "Scalable Model-Based Clustering for Large Databases Based on Data Summarization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1710-1719, Nov. 2005, doi:10.1109/TPAMI.2005.226
Usage of this product signifies your acceptance of the Terms of Use.