This Article 
 Bibliographic References 
 Add to: 
Cluster Analysis for Gene Expression Data: A Survey
November 2004 (vol. 16 no. 11)
pp. 1370-1386
DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field.

[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications SIGMOD 1998, Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 94-105, 1998.
[2] A.A. Alizadeh et al., Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling Nature, vol. 403, pp. 503-511, Feb. 2000.
[3] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Array Proc. Nat'l Academy of Science, vol. 96, no. 12, pp. 6745-6750, June 1999.
[4] O. Alter, P.O. Brown, and D. Bostein, Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling Proc. Nat'l Academy of Science, vol. 97, no. 18, pp. 10101-10106, Aug. 2000.
[5] M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander, OPTICS: Ordering Points to Identify the Clustering Structure Sigmod, pp. 49-60, 1999.
[6] A. Ben-Dor, N. Friedman, and Z. Yakhini, Class Discovery in Gene Expression Data Proc. Fifth Ann. Int'l Conf. Computational Molecular Biology (RECOMB 2001), pp. 31-38, 2001.
[7] A. Ben-Dor, R. Shamir, and Z. Yakhini, Clustering Gene Expression Patterns J. Computational Biology, vol. 6, nos. 3/4, pp. 281-297, 1999.
[8] M. Blat, S. Wiseman, and E. Domany, Super-Paramagnetic Clustering of Data Physical Review Letters, vol. 76, pp. 3251-3255, 1996.
[9] A. Brazma and J. Vilo, Minireview: Gene Expression Data Analysis Federation of European Biochemical Soc., vol. 480, pp. 17-24, June 2000.
[10] M.P.S. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S. Furey, M. AresJr., and D. Haussler, Knowledge-Based Analysis of Microarray Gene Expression Data Using Support Vector Machines Proc. Nat'l Academy of Science, vol. 97, no. 1, pp. 262-267, Jan. 2000.
[11] Y. Cheng and G.M. Church, Biclustering of Expression Data Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB), vol. 8, pp. 93-103, 2000.
[12] R.J. Cho, M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J. Lockhart, and R.W. Davis, A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle Molecular Cell, vol. 2, no. 1, pp. 65-73, July 1998.
[13] S. Chu et al., The Transcriptional Program of Sporulation in Budding Yeast Science, vol. 282, no. 5389, pp. 699-705, 1998.
[14] D.R. Bickel, Robust Cluster Analysis of DNA Microarray Data: An Application of Nonparametric Correlation Dissimilarity Proc. Joint Statistical Meetings of the Am. Statistical Assoc., (Biometrics Section), 2001.
[15] J.L. DeRisi, V.R. Iyer, and P.O. Brown, Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale Science, pp. 680-686, 1997.
[16] P. D'haeseleer, X. Wen, S. Fuhrman, and R. Somogyi, Mining the Gene Expression Matrix: Inferring Gene Relationships From Large Scale Gene Expression Data Information Processing in Cells and Tissues, pp. 203-212, 1998.
[17] C. Ding, Analysis of Gene Expression Profiles: Class Discovery and Leaf Ordering Proc. Int'l Conf. Computational Molecular Biology (RECOMB), pp. 27-136, Apr. 2002.
[18] R. Dubes and A. Jain, Algorithms for Clustering Data. Prentice Hall, 1988.
[19] B. Efron, The Jackknife, the Bootstrap, and Other Resampling Plans Proc. CBMS-NSF Regional Conf. Series in Applied Math., vol. 38, 1982.
[20] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, Cluster Analysis and Display of Genome-Wide Expression Patterns Proc. Nat'l Academy of Science, vol. 95, no. 25, pp. 14863-14868, Dec. 1998.
[21] C. Fraley and A.E. Raftery, How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis The Computer J., vol. 41, no. 8, pp. 578-588, 1998.
[22] G. Getz, E. Levine, and E. Domany, Coupled Two-Way Clustering Analysis of Gene Microarray Data Proc. Nat'l Academy of Science, vol. 97, no. 22, pp. 12079-12084, Oct. 2000.
[23] D. Ghosh and A.M. Chinnaiyan, Mixture Modelling of Gene Expression Data from Microarray Experiments Bioinformatics, vol. 18, pp. 275-286, 2002.
[24] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, D.D. Bloomfield, and E.S. Lander, Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Science, vol. 286, no. 15, pp. 531-537, Oct. 1999.
[25] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, On Clustering Validation Techniques Intelligent Information Systems J., 2001.
[26] E. Hartuv and R. Shamir, A Clustering Algorithm Based on Graph Connectivity Information Processing Letters, vol. 76, nos. 4-6, pp. 175-181, 2000.
[27] T. Hastie, R. Tibshirani, D. Boststein, and P. Brown, Supervised Harvesting of Expression Trees Genome Biology, vol. 2, no. 1, pp. 0003.1-0003.12, Jan. 2001.
[28] I. Hedenfalk, D. Duggan, Y.D. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O.P. Kallioniemi, B. Wilfond, A. Borg, and J. Trent, Gene-Expression Profiles in Hereditary Breast Cancer The New England J. Medicine, vol. 344, no. 8, pp. 539-548, Feb. 2001.
[29] J. Herrero, A. Valencia, and J. Dopazo, A Hierarchical Unsupervised Growing Neural Network for Clustering Gene Expression Patterns Bioinformatics, vol. 17, pp. 126-136, 2001.
[30] L.J. Heyer, S. Kruglyak, and S. Yooseph, Exploring Expression Data: Identification and Analysis of Coexpressed Genes Genome Research, 1999.
[31] L.J. Heyer, S. Kruglyak, and S. Yooseph, Exploring Expression Data: Identification and Analysis of Coexpressed Genes Genome Research, vol. 9, no. 11, pp. 1106-1115, 1999.
[32] A. Hill, E. Brown, M. Whitley, G. Tucker-Kellogg, C. Hunter, and D. Slonim, Evaluation of Normalization Procedures for Oligonucleotide Array Data Based on Spiked cRNA Contros Genome Biology, vol. 2, no. 12, pp. research0055.-1-0055.13, 2001.
[33] V.R. Iyer, M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J.C.F. Lee, J.M. Trent, L.M. Staudt, J. HudsonJr., M.S. Boguski, D. Lashkari, D. Shalon, D. Botstein, and P.O. Brown, The Transcriptional Program in the Response of Human Fibroblasts to Serum Science, vol. 283, pp. 83-87, 1999.
[34] A.K. Jain, M.N. Murty, and P.J. Flynn, Data Clustering: A Review ACM Computing Surveys, vol. 31, no. 3, pp. 254-323, Sept. 1999.
[35] L.M. Jakt, L. Cao, K.S.E. Cheah, and D.K. Smith, Assessing Clusters and Motifs from Gene Expression Data Genome Research, vol. 11, pp. 1112-123, 2001.
[36] D. Jiang, J. Pei, and A. Zhang, DHC: A Density-Based Hierarchical Clustering Method for Time-Series Gene Expression Data Proc. BIBE2003: Third IEEE Int'l Symp. Bioinformatics and Bioeng., 2003.
[37] D. Jiang, J. Pei, and A. Zhang, Interactive Exploration of Coherent Patterns in Time-Series Gene Expression Data Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '03), 2003.
[38] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.
[39] T. Kohonen, Self-Organization and Associative Memory. Berlin: Spring-Verlag, 1984.
[40] L. Lazzeroni and A. Owen, Plaid Models for Gene Expression Data Statistica Sinica, vol. 12, no. 1, pp. 61-86, 2002.
[41] E. Levine and E. Domany, Resampling Methods for Unsupervised Estimation of Cluster Validity Neural Computation, vol. 13, pp. 2573-2593, 2001.
[42] L. Li, W. Leping, C.R. Weinberg, T.A. Darden, and L.G. Pedersen, Gene Selection for Sample Classification Based on Gene Expression Data: Study of Sensitivity to Choice of Parameters of the ga/knn Method Bioinformatics, vol. 17, pp. 1131-1142, 2001.
[43] W. Li, Zipf's Law in Importance of Genes for Cancer Classification Using Microarray Data Lab of Statistical Genetics, Rockefeller Univ., Apr. 2001.
[44] D. Lockhart et al., Expression Monitoring by Hybridization to High-Density Oligonucleotide Arrays Nature Biotechnology, vol. 14, pp. 1675-1680, 1996.
[45] G.J. McLachlan, R.W. Bean, and D. Peel, A Mixture Model-Based Approach to the Clustering of Microarray Expression Data Bioinformatics, vol. 18, 413-422, 2002.
[46] J.B. McQueen, Some Methods for Classification and Analysis of Multivariate Observations Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, pp. 281-297, 1967.
[47] E.J. Moler, M.L. Chow, and I.S. Mian, Analysis of Molecular Profile Data Using Generative and Discriminative Methods. Physiological Genomics, vol. 4, no. 2, pp. 109-126, 2000.
[48] L.T. Nguyen et al., Flow Cytometric Analysis of in Vitro Proinflammatory Cytokine Secretion in Peripheral Blood from Multiple Sclerosis Patients J. Clinical Immunology, vol. 19, no. 3, pp. 179-185, 1999.
[49] P.J. Park, M. Pagano, and M. Bonetti, A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data Proc. Pacific Symp. Biocomputing, pp. 52-63, 2001.
[50] C.M. Perou, S.S. Jeffrey, M.V.D. Rijn, C.A. Rees, M.B. Eisen, D.T. Ross, A. Pergamenschikov, C.F. Williams, S.X. Zhu, J.C.F. Lee, D. Lashkari, D. Shalon, P.O. Brown, and D. Bostein, Distinctive Gene Expression Patterns in Human Mammary Epithelial Cells and Breast Cancers Proc. Nat'l Academy of Science, vol. 96, no. 16, pp. 9212-9217, Aug. 1999.
[51] P.A. Ralf-Herwig, C. Muller, C. Bull, H. Lehrach, and J. O'Brien, Large-Scale Clustering of cDNA-Fingerprinting Data Genome Research, vol. 9, pp. 1093-1105, 1999.
[52] K. Rose, “Deterministic Annealing for Clustering, Compression, Classification, Regression and Related Optimization Problems,” Proc. IEEE, vol. 86, pp. 2,210-2,239, 1998.
[53] K. Rose, E. Gurewitz, and G. Fox, Physical Rev. Letters, vol. 65, pp. 945-948, 1990.
[54] M.D. Schena, R. Shalon, R. Davis, and P. Brown, Quantitative Monitoring of Gene Expression Patterns with a Compolementatry DNA Microarray Science, vol. 270, pp. 467-470, 1995.
[55] J. Schuchhardt, D. Beule, A. Malik, E. Wolski, H. Eickhoff, H. Lehrach, and H. Herzel, Normalization Strategies for cDNA Microarrays Nucleic Acids Research, vol. 28, no. 10, 2000.
[56] R. Shamir and R. Sharan, Click: A Clustering Algorithm for Gene Expression Analysis Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '00), 2000.
[57] G. Sherlock, Analysis of Large-Scale Gene Expression Data Current Opinion in Immunology, vol. 12, no. 2, pp. 201-205, 2000.
[58] J.N. Siedow, Meeting Report: Making Sense of Microarrays Genome Biology, vol. 2, no. 2, pp. reports 4003.1-4003.2, 2001.
[59] F.D. Smet, J. Mathys, K. Marchal, G. Thijs, M. Moor, D. Bart, and Y. Moreau, Adaptive Quality-Based Clustering of Gene Expression Profiles Bioinformatics, vol. 18, pp. 735-746, 2002.
[60] R.R. Sokal, Clustering and Classification: Background and Current Directions Classifincation and Clustering, J. Van Ryzin, ed., Academic Press, 1977.
[61] P.T. Spellman et al., Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273-3297, 1998.
[62] P. Tamayo, D. Solni, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation Proc. Nat'l Academy of Science, vol. 96, no. 6, pp. 2907-2912, Mar. 1999.
[63] C. Tang, A. Zhang, and J. Pei, Mining Phenotypes and Informative Genes from Gene Expression Data Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '03), 2003.
[64] C. Tang, L. Zhang, I. Zhang, and M. Ramanathan, Interrelated Two-Way Clustering: An Unsupervised Approach for Gene Expression Data Analysis Proc. Second IEEE Int'l Symp. Bioinformatics and Bioeng., pp. 41-48, 2001.
[65] C. Tang and A. Zhang, An Iterative Strategy for Pattern Discovery in High-Dimensional Data Sets Proc. 11th Int'l Conf. Information and Knowledge Management (CIKM '02), 2002.
[66] S. Tavazoie, D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church, Systematic Determination of Genetic Network Architecture Nature Genetics, pp. 281-285, 1999.
[67] A. Tefferi, E. Bolander, M. Ansell, D. Wieben, and C. Spelsberg, Primer on Medical Genomics Part III: Microarray Experiments and Data Analysis Mayo Clinic Proc., vol. 77, pp. 927-940, 2002.
[68] J.G. Thomas, J.M. Olson, S.J. Tapscott, and L.P. Zhao, An Efficient and Robust Statistical Modeling Approach to Discover Differentially Expressed Genes Using Genomic Expression Profiles Genome Research, vol. 11, no. 7, pp. 1227-1236, 2001.
[69] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. Altman, Missing Value Estimation Methods for Dna Microarrays Bioinformatics, in press.
[70] V.G. Tusher, R. Tibshirani, and G. Chu, Significance Analysis of Microarrays Applied to the Ionizing Radiation Response Proc. Nat'l Academy of Science, vol. 98, no. 9, pp. 5116-5121, Apr. 2001.
[71] H. Wang, W. Wang, Y. Wei, J. Yang, and P.S. Yu, Clustering by Pattern Similarity in Large Data Sets SIGMOD 2002, Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 394-405, 2002.
[72] X. Wen, S. Fuhrman, G.S. Michaels, D.B. Carr, S. Smith, J.L. Barker, and R. Smomgyi, Large-Scale Temporal Gene Expression Mapping of Central Nervous System Development Proc. Nat'l Academy of Science, vol. 95, pp. 334-339, Jan. 1998.
[73] E.P. Xing and R.M. Karp, Cliff: Clustering of High-Dimensional Microarray Data via Iterative Feature Filtering Using Normalized Cuts Bioinformatics, vol. 17, no. 1, pp. 306-315, 2001.
[74] J. Yang, W. Wang, H. Wang, and P. Yu, $\delta{\hbox{-}}{\rm{Clusters}}$: Capturing Subspace Correlation in a Large Data Set Proc. 18th IEEE Int'l Conf. Data Eng., pp. 517-528, 2002.
[75] K.Y. Yeung and W.L. Ruzzo, An Empirical Study on Principal Component Analysis for Clustering Gene Expression Data Technical Report UW-CSE-2000-11-03, Dept. of Computer Science&Eng., Univ. of Washington, 2000.
[76] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzz, Model-Based Clustering and Data Transformations for Gene Expression Data Bioinformatics, vol. 17, pp. 977-987, 2001.
[77] K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, Validating Clustering for Gene Expression Data Bioinformatics, vol. 17, no. 4, pp. 309-318, 2001.

Index Terms:
Microarray technology, gene expression data, clustering.
Daxin Jiang, Chun Tang, Aidong Zhang, "Cluster Analysis for Gene Expression Data: A Survey," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1370-1386, Nov. 2004, doi:10.1109/TKDE.2004.68
Usage of this product signifies your acceptance of the Terms of Use.