CSDL Home IEEE/ACM Transactions on Computational Biology and Bioinformatics 2009 vol.6 Issue No.01 - January-March

Subscribe

Issue No.01 - January-March (2009 vol.6)

pp: 144-157

ABSTRACT

Clustering datasets is a challenging problem needed in a wide array of applications. Partition-optimization approaches, such as k-means or expectation-maximization (EM) algorithms, are sub-optimal and find solutions in the vicinity of their initialization. This paper proposes a staged approach to specifying initial values by finding a large number of local modes and then obtaining representatives from the most separated ones. Results on test experiments are excellent. We also provide a detailed comparative assessment of the suggested algorithm with many commonly-used initialization approaches in the literature. Finally, the methodology is applied to two datasets on diurnal microarray gene expressions and industrial releases of mercury.

INDEX TERMS

Clustering, classification, and association rules, Statistical methods, Singular value decomposition, Multivariate statistics

CITATION

Ranjan Maitra, "Initializing Partition-Optimization Algorithms",

*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol.6, no. 1, pp. 144-157, January-March 2009, doi:10.1109/TCBB.2007.70244REFERENCES

- [2] G. Brossier, “Piece-Wise Hierarchical Clustering,”
J. Classification, vol. 7, pp. 197-216, 1990.- [8] J.R. Kettenring, “The Practice of Cluster Analysis,”
J. Classification, vol. 23, pp. 3-30, 2006.- [9] F. Murtagh,
Multi-Dimensional Clustering Algorithms. Springer-Verlag, 1985.- [10] D.B. Ramey, “Nonparametric Clustering Techniques,”
Encyclopedia of Statistical Science, vol. 6, pp. 318-319, Wiley, 1985.- [11] B.D. Ripley, “Classification and Clustering in Spatial and Image Data,”
Analyzing and Modeling Data and Knowledge. pp. 85-91, Springer-Verlag, 1991.- [17] L. Kaufman and P.J. Rousseuw,
Finding Groups in Data. John Wiley and Sons, 1990.- [23] T. Kjellstrom, P. Kennedy, S. Wallis, and C. Mantell,
Physical and Mental Development of Children with Prenatal Exposure to Mercury from Fish. Stage 1: Preliminary Tests at Age 4. Swedish Nat'l Environmental Protection Board, 1986.- [24] N. Sörensen, K. Murata, E. Budtz-Jørgensen, P. Weihe, and P. Grandjean, “Prenatal Methylmercury Exposure as a Cardiovascular Risk Factor at Seven Years of Age,”
Epidemiology, vol. 10, no. 4, pp. 370-375, 1999.- [25]
Toxicological Effects of Methylmercury, N.A. of Sciences, Nat'l Academy Press, 2000.- [26]
America's Children and the Environment: Measures of Contaminants, Body Burdens, and Illnesses. US Environmental Protection Agency, 2003.- [28]
R: A Language and Environment for Statistical Computing, R Development Core Team, R Foundation for Statistical Computing, ISBN 3-900051-07-0, http:/www.R-project.org, 2006.- [30] P.S. Bradley and U.M. Fayyad, “Refining Initial Points for $K\hbox{-}{\rm Means}$ Clustering,”
Proc. 15th Int'l Conf. Machine Learning (ICML '98), pp. 91-99, 1998.- [31] U. Fayyad, C. Reina, and P. Bradley, “Initialization of Iterative Refinement Clustering Algorithms,”
Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining (KDD '98), pp. 194-198, 1998.- [34] M.B. Al-Daoud, “A New Algorithm for Cluster Initialization,”
Trans. Eng., Computing, and Technology, vol. V4, pp. 74-76, 2005.- [36] J.W. Demmel,
Applied Numerical Linear Algebra. SIAM, 1977.- [37] J. Friedman and N. Fisher, “Bump-Hunting in High-Dimensional Data,”
Statistics and Computing, vol. 9, no. 2, pp. 1-20, 1999.- [41] J.A. Hartigan and M.A. Wong, “A $k\hbox{-}{\rm Means}$ Clustering Algorithm,”
Applied Statistics, vol. 28, pp. 100-108, 1979.- [43] G. Schwarz, “Estimating the Dimensions of a Model,”
Annals of Statistics, vol. 6, pp. 461-464, 1978.- [44] S. Dasgupta, “Learning Mixtures of Gaussians,”
Proc. IEEE Symp. Foundations of Computer Science (FOCS' 99), pp. 633-644, 1999.- [45] C.B.D.J. Newman, S. Hettich, and C. Merz, “UCI Repository of Machine Learning Databases,” http://www.ics.uci.edu/~mlearnMLRepository.html , 1998.
- [47] R. Maitra, “A Statistical Perspective to Data Mining,”
J. Indian Soc. Probability and Statistics, vol. 6, pp. 28-77, 2002.- [48] P. Horton and K. Nakai, “A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins,”
Intelligent Systems in Molecular Biology, pp. 109-115, 1985.- [49] Y. Benjamini and Y. Hochberg, “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,”
J.Royal Statistical Soc., vol. 57, pp. 289-300, 1995.- [51] S.H. Lin, “Statistical Behavior of Rain Attenuation,”
Bell System Technical J., vol. 52, no. 4, pp. 557-581, 1973. |