The Community for Technology Leaders
RSS Icon
Issue No.02 - March-April (2013 vol.10)
pp: 401-414
Ariel E. Baya , French Argentine Int. Center for Inf. & Syst. Sci., UPCAM, France
Pablo M. Granitto , French Argentine Int. Center for Inf. & Syst. Sci., UPCAM, France
Clustering validation indexes are intended to assess the goodness of clustering results. Many methods used to estimate the number of clusters rely on a validation index as a key element to find the correct answer. This paper presents a new validation index based on graph concepts, which has been designed to find arbitrary shaped clusters by exploiting the spatial layout of the patterns and their clustering label. This new clustering index is combined with a solid statistical detection framework, the gap statistic. The resulting method is able to find the right number of arbitrary-shaped clusters in diverse situations, as we show with examples where this information is available. A comparison with several relevant validation methods is carried out using artificial and gene expression data sets. The results are very encouraging, showing that the underlying structure in the data can be more accurately detected with the new clustering index. Our gene expression data results also indicate that this new index is stable under perturbation of the input data.
Indexes, Clustering algorithms, Shape, Equations, Kernel, Bars, Algorithm design and analysis,genomic data, Validation index, clustering
Ariel E. Baya, Pablo M. Granitto, "How Many Clusters: A Validation Index for Arbitrary-Shaped Clusters", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 2, pp. 401-414, March-April 2013, doi:10.1109/TCBB.2013.32
[1] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, "Cluster Analysis and Display of Genome-Wide Expression Patterns," Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998.
[2] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, pp. 531-537, 1999.
[3] S.K. Archer, D. Inchaustegui, R. Queiroz, and C. Clayton, "The Cell Cycle Regulated Transcriptome of Trypanosoma brucei," PLoS ONE, vol. 6, no. 3,article e18425, 2011.
[4] M. de Souto, I. Costa, D. de Araujo, T. Ludermir, and A. Schliep, "Clustering Cancer Gene Expression Data: A Comparative Study," BMC Bioinformatics, vol. 9, no. 1,article 497, 2008.
[5] J. Wang, "A Fast Hierarchical Clustering Algorithm for Functional Modules Discovery in Protein Interaction Networks," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 3, pp. 607-620, May/June 2011.
[6] S.C. Li, "Clustering 100,000 Protein Structure Decoys in Minutes," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 9, no. 3, pp. 765-773, May/June 2012.
[7] J. Wang, M. Li, H. Wang, and Y. Pan, "Identification of Essential Proteins Based on Edge Clustering Coefficient," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 1070-1080, July/Aug. 2012.
[8] T. Lange, V. Roth, M.L. Braun, and J.M. Buhmann, "Stability-Based Validation of Clustering Solutions," Neural Computation, vol. 16, no. 6, pp. 1299-1323, 2004.
[9] G. Getz, E. Levine, and E. Domany, "Coupled Two-Way Clustering Analysis of Gene Microarray Data," Proc. Nat'l Academy of Sciences USA, vol. 97, no. 22, pp. 12079-12084, 2000.
[10] A. Ben-Hur and A. Elisseeff, and I. Guyon, "A Stability Based Method for Discovering Structure in Clustered Data," Proc. Pacific Symp. Biocomputing, pp. 6-17, 2002.
[11] J. Handl, J. Knowles, and D.B. Kell, "Computational Cluster Validation in Post-Genomic Data Analysis," Bioinformatics, vol. 21, no. 15, pp. 3201-3212, 2005.
[12] Y.B.M Halkidi and M. Vazirgiannis, "On Clustering Validation Techniques," J. Intelligent Information Systems, vol. 17, nos. 2/3, pp. 107-145, 2001.
[13] R. Tibshirani, G. Walther, and T. Hastie, "Estimating the Number of Clusters in a Data Set via the Gap Statistic," J. Royal Statistical Soc. B, vol. 63, pp. 411-423, 2003.
[14] S. Datta and S. Datta, "Evaluation of Clustering Algorithms for Gene Expression Data," BMC Bioinformatics, vol. 7, no. Suppl 4, article S17, 2006.
[15] G. Stegmayer, D.H. Milone, L. Kamenetzky, M.G. Lpez, and F. Carrari, "A Biologically Inspired Validity Measure for Comparison of Clustering Methods over Metabolic Data Sets," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 9, no. 3, pp. 706-716, May/June 2012.
[16] E.-M. M. de Villiers, C. Fauquet, T.R. Broker, H.-U.U. Bernard, and H. zur Hausen, "Classification of Papillomaviruses," Virology, vol. 324, pp. 17-27, 2004.
[17] H.-U. U. Bernard, R.D. Burk, Z. Chen, K. van Doorslaer, H. zur Hausen, and E.-M.M. de Villiers, "Classification of Papillomaviruses (PVs) Based on 189 PV Types and Proposal of Taxonomic Amendments," Virology, vol. 401, pp. 70-79, 2010.
[18] K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, "Validating Clustering for Gene Expression Data," Bioinformatics, vol. 17, no. 4,pp. 309-301, 2001.
[19] S. Dudoit and J. Fridlyand, "A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Data Set," Genome Biology, vol. 3, no. 7,article research0036, 2002.
[20] R. Tibshirani and G. Walther, "Cluster Validation by Prediction Strength," J. Computational & Graphical Statistics, vol. 14, no. 3, pp. 511-528, 2005.
[21] A. Fred and A.K. Jain, "Combining Multiple Clusterings Using Evidence Accumulation," IEEE Trans. Pattern Analysis Machine Intelligence, vol. 27, no. 6, pp. 835-850, June 2005.
[22] S. Monti, P. Tamayo, J.P. Mesirov, and T.R. Golub, "Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data," Machine Learning, vol. 52, nos. 1/2, pp. 91-118, 2003.
[23] M.G. Cardoso and A.P. de Leon, and F. de Carvalho, "Quality Indices for (Practical) Clustering Evaluation," Intelligent Data Analysis, vol. 13, pp. 725-740, 2009.
[24] T. Calinski and J. Harabasz, "A Dendrite Method for Cluster Analysis," Comm. Statistics—Theory and Methods, vol. 32, no. 1, pp. 1-27, 1974.
[25] W.J. Krzanowski and Y.T. Lai, "A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering," Biometrics, vol. 44, no. 1,pp. 23-34, 1988.
[26] J. Hartigan, Clustering Algorithms. Wiley, 1975.
[27] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990.
[28] I. Dhillon, Y. Guan, and B. Kulis, "Kernel K-Means, Spectral Clustering and Normalized Cuts," Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 551-556, 2004.
[29] A. Ng, M. Jordan, and Y. Weiss, "On Spectral Clustering: Analysis and an Algorithm," Proc. Advances in Neural Information Processing Systems, pp. 849-856, 2001.
[30] S. Johnson, "Hierarchical Clustering Schemes," Psychometrika, vol. 32, no. 3, pp. 241-254, 1967.
[31] Y. Xu, V. Olman, and D. Xu, "Clustering Gene Expression Data Using a Graph-Theoretic Approach: An Application of Minimum Spanning Trees," Bioinformatics, vol. 18, no. 4, pp. 536-545, 2002.
[32] C. Zhong, D. Miao, and R. Wang, "A Graph-Theoretical Clustering Method Based on Two Rounds of Minimum Spanning Trees," Pattern Recognition, vol. 43, pp. 752-766, 2010.
[33] S.P. Smith and A.K. Jain, "Testing for Uniformity in Multidimensional Data," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-6, no. 1, pp. 73-81, Jan. 1984.
[34] A.K. Jain, X. Xu, T.K. Ho, and F. Xiao, "Uniformity Testing Using Minimal Spanning Tree," Proc. Int'l Conf. Pattern Recognition, pp. 281-284, 2002.
[35] J.H. Friedman and L.C. Rafsky, "Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests," Annals of Statistics, vol. 7, no. 4, pp. 697-717, 1979.
[36] Z.V. Volkovich, Z. Barzily, G.-W. Weber, and D. Toledano-Kitai, "Cluster Stability Estimation Based on a Minimal Spanning Trees Approach," Proc. AIP Conf., pp. 299-305, 2009.
[37] Z. Barzily, Z. Volkovich, B.A. Öztürk, and G.-W. Weber, "On a Minimal Spanning Tree Approach in the Cluster Validation Problem," Informatica, vol. 20, no. 2, pp. 187-202, 2009.
[38] T.Z.B Fischer and J.M. Buhmann, "Path Based Pairwise Data Clustering with Application to Texture Segmentation," Energy Minimization Methods in Computer Vision and Pattern Recognition, vol. 2134, pp. 235-250, 2001.
[39] B. Fischer and J.M. Buhmann, "Bagging for Path-Based Clustering," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 11, pp. 1411-1415, Nov. 2003.
[40] S. Datta and S. Datta, "Comparisons and Validation of Statistical Clustering Techniques for Microarray Gene Expression Data," Bioinformatics, vol. 19, no. 4, pp. 459-466, 2003.
[41] V. Pihur, S. Datta, and S. Datta, "Weighted Rank Aggregation of Cluster Validation Measures: A Monte Carlo Cross-Entropy Approach," Bioinformatics, vol. 23, no. 13, pp. 1607-1615, 2007.
[42] A.E. Baya and P.M. Granitto, "Clustering Gene Expression Data with a Penalized Graph-based Metric," BMC Bioinformatics, vol. 12, article 2, 2011.
[43] G.J. McLachlan, R.W. Bean, and D. Peel, "A Mixture Model-Based Approach to the Clustering of Microarray Expression Data," Bioinformatics, vol. 18, no. 3, pp. 413-422, 2002.
[44] L. Zelnik-manor and P. Perona, "Self-Tuning Spectral Clustering," Proc. Neural Information Processing Systems, pp. 1601-1608, 2004.
[45] A. Azran and Z. Ghahramani, "Spectral Methods for Automatic Multiscale Data Clustering," Proc. Conf. Computer Vision and Pattern Recognition, pp. 190-197, 2006.
[46] B. Nadler and M. Galun, "Fundamental Limitations of Spectral Clustering," Advances in Neural Information Processing Systems 14, B. Schölkopf, J. Platt, and T. Hofmann, eds., MIT Press, pp. 1017-1024, 2007.
[47] J. Shi and J. Malik, "Normalized Cuts and Image Segmentation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000.
[48] H. He, K. Jazdzewski, W. Li, S. Liyanarachchi, R. Nagy, S. Volinia, G.A. Calin, C.G. Liu, K. Franssila, S. Suster, R.T. Kloos, C.M. Croce, and A. de la Chapelle, "The Role of Microrna Genes in Papillary Thyroid Carcinoma," Proc. Nat'l Academy of Sciences USA, vol. 102, no. 52, pp. 19075-19080, 2005.
[49] Y. Hoshida, J.-P. Brunet, P. Tamayo, T.R. Golub, and J.P. Mesirov, "Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets," PLoS ONE, vol. 2, no. 11,article 1195, 2007.
[50] A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, and L.M. Staudt, "Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling," Nature, vol. 403, no. 6769, pp. 503-511, 2000.
[51] M.P.S. Brown, W.N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T.S. Furey, J.M. Ares, and D. Haussler, "Knowledge-Based Analysis of Microarray Gene Expression Data Using Support Vector Machines," Proc. Nat'l Academy of Sciences USA, vol. 97, pp. 262-267, 2000.
[52] A. Bhattacharjee, W.G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E.J. Mark, E.S. Lander, W. Wong, B.E. Johnson, T.R. Golub, D.J. Sugarbaker, and M. Meyerson, "Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinomas Sub-Classes," Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 13790-13795, 2001.
[53] G. Milligan and M. Cooper, "A Study of Comparability of External Criteria for Hierarchical Cluster Analysis," Multivariate Behavioral Research, vol. 21, pp. 441-458, 1986.
59 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool