This Article 
 Bibliographic References 
 Add to: 
Efficiently Mining Gene Expression Data via a Novel Parameterless Clustering Method
October-December 2005 (vol. 2 no. 4)
pp. 355-365

Abstract—Clustering analysis has been an important research topic in the machine learning field due to the wide applications. In recent years, it has even become a valuable and useful tool for in-silico analysis of microarray or gene expression data. Although a number of clustering methods have been proposed, they are confronted with difficulties in meeting the requirements of automation, high quality, and high efficiency at the same time. In this paper, we propose a novel, parameterless and efficient clustering algorithm, namely, Correlation Search Technique (CST), which fits for analysis of gene expression data. The unique feature of CST is it incorporates the validation techniques into the clustering process so that high quality clustering results can be produced on the fly. Through experimental evaluation, CST is shown to outperform other clustering methods greatly in terms of clustering quality, efficiency, and automation on both of synthetic and real data sets.

[1] M.S. Aldenderfer and R.K. Blashfield, Cluster Analysis. Beverly Hills, Calif.: Sage Publications, 1984.
[2] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy of Sciences, vol. 96, pp. 6745-6750, 1999.
[3] A. Ben-Dor and Z. Yakhini, “Clustering Gene Expression Patterns,” J. Computational Biology, vol. 6, pp. 281-297, 1999.
[4] G.A. Carpenter and S. Grossberg, “A Massive Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine,” Computer Vision, Graphics, and Image Processing, vol. 37, pp. 54-115, 1987.
[5] M.-S. Chen, J. Han, and P.S. Yu, “Data Mining: An Overview from a Database Perspective,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 6, pp. 866-883, Dec. 1996.
[6] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no. 2, pp. 224-227, 1979.
[7] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Clustering Analysis and Display of Genome Wide Expression Patterns,” Proc. Nat'l Academy of Sciences, vol. 95, pp. 14863-14868, 1998.
[8] S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases,” Proc. ACM Int'l Conf. Management of Data, pp. 73-84, 1998.
[9] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proc. 15th Int'l Conf. Data Eng., pp. 512-521, 1999.
[10] R.J. Hathaway and J.C. Bezdek, “Visual Cluster Validity for Prototype Generator Clustering Models,” Pattern Recognition Letters, vol. 24, nos. 9-10, pp. 1563-1569, 2003.
[11] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[12] M.K. Kerr and G.A. Churchill, “Bootstrapping Cluster Analysis: Assessing the Reliability of Conclusions from Microarray Experiments,” Proc. Nat'l Academy of Science, vol. 98, no. 16, pp. 8961-8965, 2001.
[13] T. Kohonen, “The Self-Organizing Map,” Proc. IEEE, vol. 78, no. 9, pp. 1464-1479, 1990.
[14] F.J. Rohlf, “Classification of Aedes by Numerical Taxonomic Methods (Diptera, Culicidae),” Annals of the Entomological Soc. Am., vol. 56, pp. 798-804, 1963.
[15] D.E. Rumelhart and D. Zipser, “Feature Discovery by Competitive Learning,” Cognitive Science, vol. 9, pp. 75-112, 1985.
[16] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Fucher, “Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273-3297, 1998.
[17] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns of Gene Expression With Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation,” Proc. Nat'l Academy of Sciences, vol. 96, no. 6, pp. 2907-2912, 1999.
[18] S.M. Tseng and L.J. Chen, “An Empirical Study of the Validity of Gene Expression Clustering,” Proc. Int'l Conf. Math. and Eng. Techniques in Medicine and Biological Sciences (METMBS '02), 2002.
[19] S.M. Tseng and C.P. Kao, “Mining and Validating Gene Expression Patterns: An Integrated Approach and Applications,” Informatica, vol. 27, pp. 21-27, 2003.
[20] K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, “Validating Clustering for Gene Expression Data,” Bioinformatics, vol. 17, no. 4, pp. 309-318, 2001.
[21] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” Proc. 1996 ACM SIGMOD Int'l Conf. Management of Data, pp. 103-114, 1996.

Index Terms:
Machine learning, data mining, clustering, mining methods and algorithms.
Vincent S. Tseng, Ching-Pin Kao, "Efficiently Mining Gene Expression Data via a Novel Parameterless Clustering Method," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 4, pp. 355-365, Oct.-Dec. 2005, doi:10.1109/TCBB.2005.56
Usage of this product signifies your acceptance of the Terms of Use.