Subscribe

Issue No.03 - March (2009 vol.21)

pp: 335-350

Christopher Leckie , The University of Melbourne, Melbourne

Kotagiri Ramamohanarao , The University of Melbourne, Melbourne

James Bezdek , The University of Melbourne, Melbourne

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.158

ABSTRACT

Clustering is a popular tool for exploratory data analysis. One of the major problems in cluster analysis is the determination of the number of clusters in unlabeled data, which is a basic input for most clustering algorithms. In this paper we investigate a new method called DBE (Dark Block Extraction) for automatically estimating the number of clusters in unlabeled data sets, which is based on an existing algorithm for Visual Assessment of cluster Tendency (VAT) of a data set, using several common image and signal processing techniques. Basic steps include: 1) Generating a VAT image of an input dissimilarity matrix; 2) Performing image segmentation on the VAT image to obtain a binary image, followed by directional morphological filtering; 3) Applying a distance transform to the filtered binary image and projecting the pixel values onto the main diagonal axis of the image to form a projection signal; 4) Smoothing the projection signal, computing its first-order derivative, and then detecting major peaks and valleys in the resulting signal to decide the number of clusters. Our new DBE method is nearly "automatic", depending on just one easy-to-set parameter. Several numerical and real-world examples are presented to illustrate the effectiveness of DBE.

INDEX TERMS

Clustering, Cluster Tendency, Data and knowledge visualization, Database Applications, Database Management, Information Technology

CITATION

Christopher Leckie, Kotagiri Ramamohanarao, James Bezdek, "Automatically Determining the Number of Clusters in Unlabeled Data Sets",

*IEEE Transactions on Knowledge & Data Engineering*, vol.21, no. 3, pp. 335-350, March 2009, doi:10.1109/TKDE.2008.158REFERENCES

- [1] R.C. Gonzalez and R.E. Woods,
Digital Image Processing. Prentice Hall, 2002.- [2] I. Dhillon, D. Modha, and W. Spangler, “Visualizing Class Structure of Multidimensional Data,”
Proc. 30th Symp. Interface: Computing Science and Statistics, 1998.- [3] R.F. Ling, “A Computer Generated Aid for Cluster Analysis,”
Comm. ACM, vol. 16, pp. 355-361, 1973.- [4] T. Tran-Luu, “Mathematical Concepts and Novel Heuristic Methods for Data Clustering and Visualization,” PhD dissertation, Univ. of Maryland, College Park, 1996.
- [5] J.C. Bezdek and R. Hathaway, “VAT: A Tool for Visual Assessment of (Cluster) Tendency,”
Proc. Int'l Joint Conf. Neural Networks (IJCNN '02), pp. 2225-2230, 2002.- [6] J. Huband, J.C. Bezdek, and R. Hathaway, “bigVAT: Visual Assessment of Cluster Tendency for Large Data Sets,”
Pattern Recognition, vol. 38, no. 11, pp. 1875-1886, 2005.- [7] R. Hathaway, J.C. Bezdek, and J. Huband, “Scalable Visual Assessment of Cluster Tendency,”
Pattern Recognition, vol. 39, pp. 1315-1324, 2006.- [8] W.S. Cleveland,
Visualizing Data. Hobart Press, 1993.- [9] A.K. Jain and R.C. Dubes,
Algorithms for Clustering Data. Prentice Hall, 1998.- [10] J.C. Bezdek, R.J. Hathaway, and J. Huband, “Visual Assessment of Clustering Tendency for Rectangular Dissimilarity Matrices,”
IEEE Trans. Fuzzy Systems, vol. 15, no. 5, pp. 890-903, 2007.- [11] R. Xu and D. Wunsch II, “Survey of Clustering Algorithms,”
IEEE Trans. Neural Networks, vol. 16, no. 3, pp. 645-678, 2005.- [12] P. Guo, C. Chen, and M. Lyu, “Cluster Number Selection for a Small Set of Samples Using the Bayesian Ying-Yang Model,”
IEEE Trans. Neural Networks, vol. 13, no. 3, pp. 757-763, 2002.- [13] N. Otsu, “A Threshold Selection Method from Gray-level Histograms,”
IEEE Trans. Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62-66, 1979.- [14] P. Soille,
Morphological Image Analysis: Principles and Applications. Springer, 1999.- [15] X. Hu and L. Xu, “A Comparative Study of Several Cluster Number Selection Criteria,”
Proc. Fourth Int'l Conf. Intelligent Data Eng. and Automated Learning (IDEAL '03), pp. 195-202, 2003.- [16] P.J. Rousseeuw, “A Graphical Aid to the Interpretations and Validation of Cluster Analysis,”
J. Computational and Applied Math., vol. 20, pp. 53-65, 1987.- [17] G. Milligan and M. Cooper, “An Examination of Procedures for Determining the Number of Clusters in a Data Set,”
Psychometrika, vol. 50, pp. 159-179, 1985.- [18] U. Maulik and S. Bandyopadhyay, “Performance Evaluation of Some Clustering Algorithms and Validity Indices,”
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp.1650-1654, Dec. 2002.- [19] R.B. Calinski and J. Harabasz, “A Dendrite Method for Cluster Analysis,”
Comm. in Statistics, vol. 3, pp. 1-27, 1974.- [20] M. Windham and A. Cutler, “Information Ratios for Validating Mixture Analysis,”
J. Am. Statistical Assoc., vol. 87, pp. 1188-1192, 1992.- [21] P. Grünwald, P. Kontkanen, P. Myllymäki, T. Silander, and H. Tirri, “Minimum Encoding Approaches for Predictive Modeling,”
Proc.14th Int'l Conf. Uncertainty in Artificial Intelligence (UAI '98), pp. 183-192, 1998.- [22] J.W. Tukey,
Exploratory Data Analysis. Addison-Wesley, 1997.- [23] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the Number of Clusters in a Dataset via the Gap Statistics,”
J. Royal Statistical Soc. B, vol. 63, pp. 411-423, 2001.- [24]
UCI Repository of Machine Learning Databases, http://www.ics. uci.edu/~mlearnMLRepository.html , 2008.- [25] I. Sledge, J. Huband, and J.C. Bezdek, “(Automatic) Cluster Count Extraction from Unlabeled Datasets,”
Joint Proc. Fourth Int'l Conf. Natural Computation (ICNC) and Fifth Int'l Conf. Fuzzy Systems and Knowledge Discovery (FSKD), 2008.- [26] L. Wang and D. Suter, “Visual Learning and Recognition of Sequential Data Manifolds with Applications to Human Movement Analysis,”
Computer Vision and Image Understanding, 2007.- [27] J.C. Bezdek, J.M. Keller, R. Krishnapuram, and N.R. Pal,
Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer Academic Publishers, 1999.- [28] J.C. Bezdek and N.R. Pal, “Some New Indices of Cluster Validity,”
IEEE Trans. System, Man and Cybernetics, vol. 28, no. 3, pp. 301-315, 1998.- [29] J.C. Bezdek, W. Li, Y. Attikiouzel, and M.P. Windham, “A Geometric Approach to Cluster Validity for Normal Mixtures,”
Soft Computing, vol. 1, pp. 166-179, 1997.- [30] W. Wang and Y. Zhang, “On Fuzzy Cluster Validity Indices,”
Fuzzy Sets and Systems, vol. 158, pp. 2095-2117, 2007.- [31] S. Monti et al., “Molecular Profiling of Diffuse Large B-cell Lymphoma Identifies Robust Subtypes Including One Characterized by Host Inflammatory Response,”
Blood, vol. 106, no. 5, pp.1851-1861, 2005.- [32] G.D. Floodgate and P.R. Hayes, “The Adansonian Taxonomy of Some Yellow Pigmented Marine Bacteria,”
J. General Microbiology, vol. 30, pp. 237-244, 1963.- [33] N. Pal, J. Keller, M. Popescu, J. Bezdek, J. Mitchell, and J. Huband, “Gene Ontology-Based Knowledge Discovery through Fuzzy Cluster Analysis,”
J. Neural, Parallel and Scientific Computing, vol. 13, pp. 337-361, 2005.- [34] A. Juan and E. Vidal, “Fast K-Means-like Clustering in Metric Space,”
Pattern Recognition Letters, vol. 15, no. 1, pp. 19-25, 1994.- [35]
Decomposition Methodology for Knowledge Discovery and Data Mining, O. Maimon and L. Rokach, eds., pp. 90-94. World Scientific, 2005.- [36] W. McCormick, P. Schweitzer, and T. White, “Problem Decomposition and Data Reorganization by a Cluster Technique,”
Operations Research, vol. 20, no. 5, pp. 993-1009, 1972.- [37]
Statistical Pattern Recognition. A. Webb, ed., pp. 345-357. John Wiley & Sons, 2002.- [38] A. Gordon,
Classification, second ed. Chapman and Hall, CRC, 1999.- [39] S. Roweis and L. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,”
Science, vol. 290, no. 5500, pp.2323-2326, 2000.- [40] J.B. Tenenbaum, V. Silva, and J. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,”
Science, vol. 290, no. 5500, pp. 2319-2323, 2000.- [41] D. Cai, X. He, and J. Han, “Spectral Regression for Efficient Regularized Subspace Learning,”
Proc. 11th Int'l Conf. Computer Vision (ICCV), 2007.- [42] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,”
Proc. Advances in Neural Information Processing Systems (NIPS), 2002.- [43] M. Breitenbach and G. Grudic, “Clustering through Ranking on Manifolds,”
Proc. 22nd Int'l Conf. Machine Learning (ICML), 2005.- [44] R.B. Cattell, “A Note on Correlation Clusters and Cluster Search Methods,”
Psychometrika, vol. 9, no. 3, pp. 169-184, 1944.- [45] P. Sneath, “A Computer Approach to Numerical Taxonomy,”
J.General Microbiology, vol. 17, pp. 201-226, 1957.- [46] T.C. Havens, J.C. Bezdek, J.M. Keller, M. Popescu, and J.M. Huband, “Is VAT Really Single Linkage in Disguise?”
Pattern Recognition Letters, 2008, in review. |