The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January-March (2009 vol.6)
pp: 144-157
ABSTRACT
Clustering datasets is a challenging problem needed in a wide array of applications. Partition-optimization approaches, such as k-means or expectation-maximization (EM) algorithms, are sub-optimal and find solutions in the vicinity of their initialization. This paper proposes a staged approach to specifying initial values by finding a large number of local modes and then obtaining representatives from the most separated ones. Results on test experiments are excellent. We also provide a detailed comparative assessment of the suggested algorithm with many commonly-used initialization approaches in the literature. Finally, the methodology is applied to two datasets on diurnal microarray gene expressions and industrial releases of mercury.
INDEX TERMS
Clustering, classification, and association rules, Statistical methods, Singular value decomposition, Multivariate statistics
CITATION
Ranjan Maitra, "Initializing Partition-Optimization Algorithms", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.6, no. 1, pp. 144-157, January-March 2009, doi:10.1109/TCBB.2007.70244
REFERENCES
[1] J.D. Banfield and A.E. Raftery, “Model-Based Gaussian and Non-Gaussian Clustering,” Biometrics, vol. 49, pp. 803-821, 1993.
[2] G. Brossier, “Piece-Wise Hierarchical Clustering,” J. Classification, vol. 7, pp. 197-216, 1990.
[3] G. Celeux and G. Govaert, “Gaussian Parsimonious Clustering Models,” Computational Statistics and Data Analysis, vol. 28, pp.781-793, 1995.
[4] W.F. Eddy, A. Mockus, and S. Oue, “Approximate Single Linkage Cluster Analysis of Large Datasets in High-Dimensional Spaces,” Computational Statistics and Data Analysis, vol. 23, pp. 29-43, 1996.
[5] B.S. Everitt, “A Finite Mixture Model for the Clustering of Mixed-Mode Data,” Statistics and Probability Letters, vol. 6, pp. 305-309, 1988.
[6] I.J. Good, “The Clustering of Random Variables,” J. Statistical Computing and Simulation, vol. 9, pp. 241-248, 1979.
[7] J. Hartigan, “Statistical Theory in Clustering,” J. Classification, vol. 2, pp. 63-76, 1985.
[8] J.R. Kettenring, “The Practice of Cluster Analysis,” J. Classification, vol. 23, pp. 3-30, 2006.
[9] F. Murtagh, Multi-Dimensional Clustering Algorithms. Springer-Verlag, 1985.
[10] D.B. Ramey, “Nonparametric Clustering Techniques,” Encyclopedia of Statistical Science, vol. 6, pp. 318-319, Wiley, 1985.
[11] B.D. Ripley, “Classification and Clustering in Spatial and Image Data,” Analyzing and Modeling Data and Knowledge. pp. 85-91, Springer-Verlag, 1991.
[12] M.J. Symons, “Clustering Criteria and Multivariate Normal Mixtures,” Biometrics, vol. 37, pp. 35-43, 1981.
[13] R.J. Tibshirani and G. Walther, “Cluster Validation by Prediction Strength,” J. Computational and Graphical Statistics, vol. 14, no. 3, pp. 511-528, 2005.
[14] H.P. Friedman and J. Rubin, “On Some Invariant Criteria for Grouping Data,” J. Am. Statistical Assoc., vol. 62, pp. 1159-1178, 1967.
[15] A.J. Scott and M.J. Symons, “Clustering Methods Based on Likelihood Ratio Criteria,” Biometrics, vol. 27, pp. 387-397, 1971.
[16] W.D. Fisher, “On Grouping for Maximum Homogeneity,” J. Am. Statistical Assoc., vol. 53, pp. 789-798, 1958.
[17] L. Kaufman and P.J. Rousseuw, Finding Groups in Data. John Wiley and Sons, 1990.
[18] D. Steinley, “Local Optima in $k\hbox{-}{\rm means}$ Clustering: What You Don't Know May Hurt You,” Psychological Methods, vol. 8, pp. 294-304, 2003.
[19] S.C. Zeeman, A. Tiessen, E. Pilling, K.L. Kato, A.M. Donald, and A.M. Smith, “Starch Synthesis in Arabidopsis. Granule Synthesis, Composition, and Structure,” Plant Physiology, vol. 129, pp. 516-529, 2002.
[20] A.M. Smith, S.C. Zeeman, D. Thorneycroft, and S.M. Smith, “Starch Mobilization in Leaves,” J. Experimental Botany, vol. 54, pp.577-583, 2003.
[21] S.M. Smith, D.C. Fulton, T. Chia, D. Thorneycroft, A. Chapple, H. Dunstan, C. Hylton, S.C. Zeeman, and A.M. Smith, “Diurnal Changes in the Transcriptome Encoding Enzymes of Starch Metabolism Provide Evidence for Both Transcriptional and Posttranscriptional Regulation of Starch Metabolism in Arabidopsis Leaves,” Plant Physiology, vol. 136, pp. 2687-2699, 2004.
[22] P. Grandjean, P. Weihe, R.F. White, F. Debes, S. Araki, K. Yokoyama, K. Murata, N. Sörensen, R. Dahl, and P.J. Jorgensen, “Cognitive Deficit in 7-Year-Old Children with Prenatal Exposure to Methylmercury,” Neurotoxicology and Teratology, vol. 19, no. 6, pp. 417-428, 1997.
[23] T. Kjellstrom, P. Kennedy, S. Wallis, and C. Mantell, Physical and Mental Development of Children with Prenatal Exposure to Mercury from Fish. Stage 1: Preliminary Tests at Age 4. Swedish Nat'l Environmental Protection Board, 1986.
[24] N. Sörensen, K. Murata, E. Budtz-Jørgensen, P. Weihe, and P. Grandjean, “Prenatal Methylmercury Exposure as a Cardiovascular Risk Factor at Seven Years of Age,” Epidemiology, vol. 10, no. 4, pp. 370-375, 1999.
[25] Toxicological Effects of Methylmercury, N.A. of Sciences, Nat'l Academy Press, 2000.
[26] America's Children and the Environment: Measures of Contaminants, Body Burdens, and Illnesses. US Environmental Protection Agency, 2003.
[27] W.F. Fitzgerald, D.R. Engstrom, R.P. Mason, and E.A. Nater, “The Case for Atmospheric Mercury Contamination in Remote Areas,” Environmental Science and Technology, vol. 32, pp. 1-7, 1998.
[28] R: A Language and Environment for Statistical Computing, R Development Core Team, R Foundation for Statistical Computing, ISBN 3-900051-07-0, http:/www.R-project.org, 2006.
[29] J.A. Lozano, J.M.P. Na, and P.L. Naga, “An Empirical Comparison of Four Initialization Methods for the $k\hbox{-}{\rm Means}$ Algorithm,” Pattern Recognition Letters, vol. 20, pp. 1027-1040, 1999.
[30] P.S. Bradley and U.M. Fayyad, “Refining Initial Points for $K\hbox{-}{\rm Means}$ Clustering,” Proc. 15th Int'l Conf. Machine Learning (ICML '98), pp. 91-99, 1998.
[31] U. Fayyad, C. Reina, and P. Bradley, “Initialization of Iterative Refinement Clustering Algorithms,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining (KDD '98), pp. 194-198, 1998.
[32] C. Fraley and A.E. Raftery, “How Many Clusters? Which Cluster Method? Answers via Model-Based Cluster Analysis,” Computer J., vol. 41, pp. 578-588, 1998.
[33] G.C. Tseng and W.H. Wong, “Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data,” Biometrics, vol. 61, pp. 10-16, 2005.
[34] M.B. Al-Daoud, “A New Algorithm for Cluster Initialization,” Trans. Eng., Computing, and Technology, vol. V4, pp. 74-76, 2005.
[35] C. Biernacki, G.C. Celeux, and G. Govaert, “Choosing Starting Values for the Em Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models,” Computational Statistics and Data Analysis, vol. 413, pp. 561-575, 2003.
[36] J.W. Demmel, Applied Numerical Linear Algebra. SIAM, 1977.
[37] J. Friedman and N. Fisher, “Bump-Hunting in High-Dimensional Data,” Statistics and Computing, vol. 9, no. 2, pp. 1-20, 1999.
[38] L. Hubert and P. Arabie, “Comparing Partitions,” J. Classification, vol. 2, pp. 193-218, 1985.
[39] W.M. Rand, “Objective Criteria for the Evaluation of Clustering Methods,” J. Am. Statistical Assoc., vol. 66, pp. 846-850, 1971.
[40] D. Steinley and M.J. Brusco, “Initializing $k\hbox{-}{\rm Means}$ Batch Clustering: A Critical Evaluation of Several Techniques,” J. Classification, vol. 24, pp. 99-121, 2007.
[41] J.A. Hartigan and M.A. Wong, “A $k\hbox{-}{\rm Means}$ Clustering Algorithm,” Applied Statistics, vol. 28, pp. 100-108, 1979.
[42] F.H. Marriott, “Practical Problems in a Method of Cluster Analysis,” Biometrics, vol. 27, pp. 501-514, 1971.
[43] G. Schwarz, “Estimating the Dimensions of a Model,” Annals of Statistics, vol. 6, pp. 461-464, 1978.
[44] S. Dasgupta, “Learning Mixtures of Gaussians,” Proc. IEEE Symp. Foundations of Computer Science (FOCS' 99), pp. 633-644, 1999.
[45] C.B.D.J. Newman, S. Hettich, and C. Merz, “UCI Repository of Machine Learning Databases,” http://www.ics.uci.edu/~mlearnMLRepository.html , 1998.
[46] K. Nakai and M. Kinehasa, “Expert System for Predicting Protein Localization Sites in Gram-Negative Bacteria,” PROTEINS: Structure, Function, and Genetics, vol. 11, pp. 95-110, 1991.
[47] R. Maitra, “A Statistical Perspective to Data Mining,” J. Indian Soc. Probability and Statistics, vol. 6, pp. 28-77, 2002.
[48] P. Horton and K. Nakai, “A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins,” Intelligent Systems in Molecular Biology, pp. 109-115, 1985.
[49] Y. Benjamini and Y. Hochberg, “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,” J.Royal Statistical Soc., vol. 57, pp. 289-300, 1995.
[50] R.J. Tibshirani, G. Walther, and T.J. Hastie, “Estimating the Number of Clusters in a Dataset via the Gap Statistic,” J. Royal Statistical Soc., vol. 63, no. 2, pp. 411-423, 2003.
[51] S.H. Lin, “Statistical Behavior of Rain Attenuation,” Bell System Technical J., vol. 52, no. 4, pp. 557-581, 1973.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool