This Article 
 Bibliographic References 
 Add to: 
Dynamic Cluster Formation Using Level Set Methods
June 2006 (vol. 28 no. 6)
pp. 877-889
Density-based clustering has the advantages for 1) allowing arbitrary shape of cluster and 2) not requiring the number of clusters as input. However, when clusters touch each other, both the cluster centers and cluster boundaries (as the peaks and valleys of the density distribution) become fuzzy and difficult to determine. We introduce the notion of cluster intensity function (CIF) which captures the important characteristics of clusters. When clusters are well-separated, CIFs are similar to density functions. But, when clusters become closed to each other, CIFs still clearly reveal cluster centers, cluster boundaries, and degree of membership of each data point to the cluster that it belongs. Clustering through bump hunting and valley seeking based on these functions are more robust than that based on density functions obtained by kernel density estimation, which are often oscillatory or oversmoothed. These problems of kernel density estimation are resolved using Level Set Methods and related techniques. Comparisons with two existing density-based methods, valley seeking and DBSCAN, are presented which illustrate the advantages of our approach.

[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufman, 2001.
[2] A.K. Jain, M.N. Murty, and P.J. Flyn, “Data Clustering: A Review,” ACM Computer Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[3] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.
[4] P. Berkhin, “Survey of Clustering Data Mining Techniques,” technical report, Accrue Software, 2002.
[5] P. Hansen and B. Jaumard, “Cluster Analysis and Mathematical Programming,” Math. Programming, vol. 79, pp. 191-215, 1997.
[6] S.P. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans. Information Theory, vol. 28, no. 2, pp. 129-137, 1982.
[7] J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. I: Statistics, pp. 281-297, 1967.
[8] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering with Bregman Divergences,” Proc. Fourth SIAM Int'l Conf. Data Mining, pp. 234-245, 2004.
[9] R.T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proc. 20th Int'l Conf. Very Large Data Bases, pp. 144-155, 1994.
[10] C. Ding and X. He, “Cluster Aggregate Inequality and Multi-Level Hierarchical Clustering,” Proc. Ninth European Conf. Principles of Data Mining and Knowledge Discovery, pp. 71-83, 2005.
[11] G. Karypis, E. Han, and V. Kumar, “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling,” Computer, vol. 32, pp. 68-75, 1999.
[12] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed. Boston Academic Press, 1990.
[13] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 226-231, 1996.
[14] J. Sander, M. Ester, H. Kriegel, and X. Xu, “Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169-194, 1998.
[15] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” Proc. ACM-SIGMOD Int'l Conf. Management of Data, pp. 94-105, 1998.
[16] A. Hinneburg and D.A. Keim, “An Efficient Approach to Clustering in Large Multimedia Databases with Noise,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 58-65, 1998.
[17] M. Ankerst, M. Breunig, H.P. Kriegel, and J. Sander, “OPTICS: Ordering Points to Identify the Clustering Structure,” Proc. ACM-SIGMOD Int'l Conf. Management of Data, pp. 49-60, 1999.
[18] W. Wang, J. Yang, and R. Muntz, “STING: A Statistical Information Grid Approach to Spatial Data Mining,” Proc. 23rd Very Large Data Bases Conf., pp. 186-195, 1997.
[19] G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster: A Wavelet-Based Clustering Approach for Spatial Data in Very Large Databases,” The Very Large Databases J., vol. 8, pp. 289-304, 2000.
[20] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “Spectral Min-Max Cut for Graph Partitioning and Data Clustering,” Proc. First IEEE Int'l Conf. Data Mining, pp. 107-114, 2001.
[21] L. Ertoz, M. Steinbach, and V. Kumar, “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data,” Proc. Third SIAM Int'l Conf. Data Mining, pp. 47-58, 2003.
[22] R. Sharan and R. Shamir, “CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis,” Proc. Int'l Conf. Intelligent Systems for Molecular Biology, pp. 307-316, 2000.
[23] G.J. McLachlan and D. Peel, Finite Mixture Models. Wiley, 2001.
[24] E. Parzen, “On Estimation of a Probability Density Function and Mode,” Ann. Math. Statistics, vol. 33, pp. 1065-1076, 1962.
[25] R. Duda, P. Hart, and D. Stork, Pattern Classification, second ed. Wiley-Interscience, 2001.
[26] S.R. Sain, K.A. Baggerly, and D.W. Scott, “Cross-Validation of Multivariate Densities,” J. Am. Statisitcal Assoc., vol. 89, pp. 807-817, 1992.
[27] J.A. Sethian, Level Set Methods and Fast Marching Methods, second ed. New York: Cambridge Univ. Press, 1999.
[28] S. Osher and R. Fedkiw, Level Set Methods and Dynamic Implicit Surfaces. New York: Spring Verlag, 2003.
[29] A.K. Jain, Fundamentals of Digital Image Processing. Prentice Hall, 1988.
[30] S. Osher and J.A. Sethian, “Fronts Propagating with Curvature-Dependent Speed: Algorithms Based on Hamiton-Jacobi Formulations,” J. Computational Physics, vol. 79, pp. 12-49, 1988.
[31] H.K. Zhao, T. Chan, B. Merriman, and S. Osher, “A Variational Level Set Approach to Multiphase Motion,” J. Compuational Physics, vol. 127, pp. 179-195, 1996.
[32] V. Caselles, F. Catté, T. Coll, and F. Dibos, “A Geometric Model for Active Contours in Image Processing,” Numerische Mathematik, vol. 66, pp. 1-31, 1993.
[33] A. Ben-Hur, D. Horn, H.T. Siegelmann, and V. Vapnik, “Support Vector Clustering,” J. Machine Learning Research, vol. 2, pp. 125-137, 2001.
[34] P. Spellman, G. Sherlock, M. Zhang, V. Iyer, and K. Anders, “Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization,” Molecular Biology Cell, vol. 9, no. 12, pp. 3273-3297, 1998.
[35] I.T. Jolliffe, Principal Component Analysis, second ed. Springer, 2002.
[36] G. Sapiro, Geometric Partial Differential Equations. New York: Cambridge Univ. Press, 2001.
[37] Y.R. Tsai, L. Cheng, S. Osher, and H. Zhao, “Fast Sweeping Algorithms for a Class of Hamilton-Jacobi Equations,” SIAM J. Numerical Analysis, vol. 41, no. 2, pp. 673-694, 2003.
[38] C. Zenger, “Sparse Grids,” Parallel Algorithms for Partial Differntial Equations, vol. 31, 1991.
[39] H. Bungartz and M. Griebel, “Sparse Grids,” Acta Numerica, pp. 147-269, 2004.
[40] J. Garcke and M. Griebel, “Classification with Sparse Grids Using Simplicial Basis Functions,” Proc. Seventh ACM SIGKDD, pp. 87-96, 2001.
[41] J. Garcke, M. Griebel, and M. Thess, “Data Mining with Sparse Grids,” Computing, vol. 67, no. 3, pp. 225-253, Mar. 2001.
[42] J. Garcke, M. Hegland, and O. Nielsen, “Parallelisation of Sparse Grids for Large Scale Data Analysis,” Proc. Int'l Conf. Computational Science, pp. 683-692, 2003.

Index Terms:
Dynamic clustering, level set methods, cluster intensity functions, kernel density estimation, cluster contours, partial differential equations.
Andy M. Yip, Chris Ding, Tony F. Chan, "Dynamic Cluster Formation Using Level Set Methods," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 6, pp. 877-889, June 2006, doi:10.1109/TPAMI.2006.117
Usage of this product signifies your acceptance of the Terms of Use.