This Article 
 Bibliographic References 
 Add to: 
Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering
July-September 2008 (vol. 5 no. 3)
pp. 385-400
It is a consensus in microarray analysis that identifying potential local patterns, characterized by coherent groups of genes and conditions, may shed light on the discovery of previously undetectable biological cellular processes of genes as well as macroscopic phenotypes of related samples. In order to simultaneously cluster genes and conditions, we have previously developed a fast co-clustering algorithm, Minimum Sum-Squared Residue Co-clustering (MSSRCC), which employs an alternating minimization scheme and generates what we call co-clusters in a checkerboard structure. In this paper, we propose specific strategies that enable MSSRCC to escape poor local minima and resolve the degeneracy problem in partitional clustering algorithms. The strategies include binormalization, deterministic spectral initialization, and incremental local search. We assess the effects of various strategies on both synthetic gene expression datasets and real human cancer microarrays and provide empirical evidence that MSSRCC with the proposed strategies performs better than existing co-clustering and clustering algorithms. In particular, the combination of all the three strategies leads to the best performance. Furthermore, we illustrate coherence of the resulting co-clusters in a checkerboard structure, where genes in a co-cluster manifest the phenotype structure of corresponding specific samples, and evaluate the enrichment of functional annotations in Gene Ontology (GO).

[1] J.L. DeRisi, V.R. Iyer, and P.O. Brown, “Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale,” Science, vol. 278, no. 5338, pp. 680-686, 1997.
[2] P.F. Macgregor and J.A. Squire, “Application of Microarrays to the Analysis of Gene Expression in Cancer,” Clinical Chemistry, vol. 48, no. 8, pp. 1170-1177, 2002.
[3] D.K. Slonim, “From Patterns to Pathways: Gene Expression Data Analysis Comes of Age,” Nature Genetics Supplement, vol. 32, pp.502-508, 2002.
[4] M. Schena, Microarray Analysis. John Wiley & Sons, 2003.
[5] M.F. Ochs and A.K. Godwin, “Microarrays in Cancer: Research and Applications,” BioTechniques, vol. 34, pp. S4-S15, 2003.
[6] D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene Expression Data: A Survey,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1370-1386, Nov. 2004.
[7] R. Shamir and R. Sharan, “Algorithmic Approaches to Clustering Gene Expression Data,” Current Topics in Computational Molecular Biology, pp. 269-299, MIT Press, 2002.
[8] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Science, vol. 95, no. 25, pp. 14 863-14 868, 1998.
[9] T.R. Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, 1999.
[10] A.A. Alizadeh et al., “Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling,” Nature, vol. 403, no. 6769, pp. 503-511, 2000.
[11] A. Ben-Dor, B. Chor, R.M. Karp, and Z. Yakhini, “Discovering Local Structure in Gene Expression Data: The Order-Preserving Submatrix Problem,” J. Computational Biology, vol. 10, nos. 3-4, pp.373-384, 2003.
[12] H. Cho, I.S. Dhillon, Y. Guan, and S. Sra, “Minimum Sum-Squared Residue Based Co-clustering of Gene Expression Data,” Proc. Fourth SIAM Int'l Conf. Data Mining (SDM '04), pp. 114-125, 2004.
[13] J.A. Hartigan, “Direct Clustering of a Data Matrix,” J. Am. Statistical Assoc., vol. 67, no. 337, pp. 123-129, 1972.
[14] Y. Cheng and G.M. Church, “Biclustering of Expression Data,” Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '00), vol. 8, pp. 93-103, 2000.
[15] O.E. Livne and G.H. Golub, “Scaling by Binormalization,” Numerical Algorithms, vol. 35, no. 1, pp. 97-120, 2004.
[16] I.S. Dhillon, Y. Guan, and J. Kogan, “Iterative Clustering of High Dimensional Text Data Augmented by Local Search,” Proc. Second IEEE Int'l Conf. Data Mining (ICDM), 2002.
[17] T.G.O. Consortium, “Gene Ontology: Tool for the Unification of Biology,” Nature Genetics, vol. 25, pp. 25-29, 2000.
[18] G. Dennis Jr. et al., “DAVID: Database for Annotation, Visualization, and Integrated Discovery,” Genome Biology, vol. 4, no. R60, 2003.
[19] I.V. Mechelen, H. Bock, and P.D. Boeck, “Two-Mode Clustering Methods: A Structured Overview,” Statistical Methods in Medical Research, vol. 13, pp. 363-394, 2004.
[20] S.C. Madeira and A.L. Oliveira, “Biclustering Algorithms for Biological Data Analysis: A Survey,” IEEE Trans. Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 24-45, Jan.-Mar. 2004.
[21] B. Mirkin, Mathematical Classification and Clustering. Kluwer Academic Publishers, 1996.
[22] I. Csiszár and G. Tusnády, “Information Geometry and Alternating Minimization Procedure,” Statistics and Decisions, supplemental issue, vol. 1, pp. 205-237, 1984.
[23] W. Gaul and M. Schader, “A New Algorithm for Two-Mode Clustering,” Data Analysis and Information Systems, H. Hermann and W. Polasek, eds., pp. 15-23, Springer, 1996.
[24] D. Baier, W. Gaul, and M. Schader, “Two-Mode Overlapping Clustering with Applications to Simultaneous Benefit Segmentation and Market Structuring,” Classification and Knowledge Organization: Recent Advances and Applications, R. Klar and O. Opitz, eds., pp. 557-566, Springer, 1997.
[25] V. Maurizio, “Double $k\hbox{-}{\rm means}$ Clustering for Simultaneous Classification of Objects and Variables,” Advances in Classification and Data Analysis, S. Borra, R. Rocci, M. Vichi, and M.Schader,eds., pp. 43-52, Springer, 2001.
[26] J. Yang, H. Wang, W. Wang, and P. Yu, “Enhanced Biclustering on Expression Data,” Proc. Third IEEE Symp. Bioinformatics and BioEngineering (BIBE '03), pp. 321-327, 2003.
[27] J. Yang, W. Wang, H. Wang, and P. Yu, “$\delta\hbox{-}{\rm Clusters}$ : Capturing Subspace Correlation in a Large Data Set,” Proc. 18th IEEE Int'l Conf. Data Eng. (ICDE '02), pp. 517-528, 2002.
[28] Y. Kluger, R. Basri, J.T. Chang, and M. Gerstein, “Spectral Biclustering of Microarray Data: Coclustering of Genes and Conditions,” Genome Research, vol. 13, no. 4, pp. 703-716, 2003.
[29] I.S. Dhillon, “Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning,” Proc. Seventh ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '01), pp. 269-274, 2001.
[30] I.S. Dhillon, S. Mallela, and D.S. Modha, “Information-Theoretic Co-Clustering,” Proc. Ninth ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '03), pp. 89-98, 2003.
[31] A. Banerjee, I.S. Dhillon, J. Ghosh, S. Merugu, and D.S. Modha, “A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximation,” J. Machine Learning Research, vol. 8, pp. 1919-1986, 2007.
[32] S. Bleuler, A. Prelić, and E. Zitzler, “An EA Framework for Biclustering of Gene Expression Data,” Proc. Sixth Congress on Evolutionary Computation (CEC '04), pp. 166-173, 2004.
[33] T.H. Bø and I. Jonassen, “New Feature Subset Selection Procedures for Classification of Expression Profiles,” Genome Biology, vol. 3, no. 4, 2002.
[34] M. Dettling and P. Bühlmann, “Supervised Clustering of Genes,” Genome Biology, vol. 3, no. 12, 2002.
[35] S. Dudoit and J. Fridlyand, “A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Dataset,” Genome Biology, vol. 3, no. 7, pp. 0036.1-0036.21, 2002.
[36] F.C. Sánchez, P.J. Lewi, and D.L. Massart, “Effect of Different Preprocessing Methods for Principal Component Analysis Applied to the Composition of Mixtures: Detection of Impurities in HPLC-DAD,” Chemometrics and Intelligent Laboratory Systems, vol. 25, no. 2, pp. 157-177, 1994.
[37] L. Wouters et al., “Graphical Exploration of Gene Expression Data: A Comparative Study of Three Multivariate Methods,” Biometrics, vol. 59, pp. 1131-1139, 2003.
[38] B.R. Kowalski and C.F. Bender, “Pattern Recognition: A Powerful Approach to Interpreting Chemical Data,” J. Am. Chemical Soc., vol. 94, no. 16, pp. 5632-5639, 1972.
[39] R.A. Harshman and M.E. Lundy, “, Data Preprocessing and the Extended PARAFAC Model,” Research Methods for Multimode Data Analysis, pp. 216-284, Praeger, 1984.
[40] R. Bro and A.K. Smilde, “Centering and Scaling in Component Analysis,” J. Chemometrics, vol. 17, pp. 16-33, 2003.
[41] A. Smilde, R. Bro, and P. Geladi, “Preprocessing” Multi-Way Analysis with Applications in the Chemical Sciences, pp. 221-255, John Wiley & Sons, 2004.
[42] D.S. Johnson, “The NP-Completeness Column: An Ongoing Guide,” J. Algorithms, vol. 8, no. 3, pp. 438-448, 1987.
[43] C. Eckart and G. Young, “The Approximation of One Matrix by Another of Lower Rank,” Psychometrika, vol. 1, pp. 211-218, 1936.
[44] S.X. Yu and J. Shi, “Multiclass Spectral Clustering,” Proc. Ninth IEEE Int'l Conf. Computer Vision, 2003.
[45] A. Prelić et al., “A Systematic Comparison and Evaluation of Biclustering Methods for Gene Expression Data,” Bioinformatics, vol. 22, no. 9, pp. 1122-1129, 2006.
[46] U. Alon et al., “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy of Science, vol. 96, no. 12, pp. 6745-6750, 1999.
[47] G.J. Gordon et al., “Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma,” Cancer Research, vol. 62, pp.4963-4967, 2002.
[48] S.A. Armstrong et al., “MLL Translocations Specify a Distinct Gene Expression Profile that Distinguishes a Unique Leukemia,” Nature Genetics, vol. 30, pp. 41-47, 2002.
[49] P.J. Lewi, “Spectral Map Analysis: Factorial Analysis of Contrast, Especially from Log Ratios,” Chemometrics and Intelligent Laboratory Systems, vol. 5, no. 2, pp. 105-116, 1989.
[50] M. Xiong, W. Li, J. Zhao, L. Jin, and E. Boerwinkle, “Feature (Gene) Selection in Gene Expression-Based Tumor Classification,” Molecular Genetics and Metabolism, vol. 73, pp. 239-247, 2001.
[51] E. Yeoh et al., “Classification, Subtype Discovery, and Prediction of Outcome in Pediatric Acute Lymphoblastic Leukemia by Gene Expression Profiling,” Cancer Cell, vol. 1, no. 2, pp. 133-143, 2002.
[52] I.G. Costa, F.A. de Carvalho, and M.C. de Souto, “Comparative Analysis of Clustering Methods for Gene Expression Time Course Data,” Genetics and Molecular Biology, vol. 27, no. 4, pp. 623-631, 2004.
[53] S. Datta and S. Datta, “Comparisons and Validation of Statistical Clustering Techniques for Microarray Gene Expression Data,” Bioinformatics, vol. 19, pp. 459-466, 2003.
[54] F.D. Gibbons and F.P. Roth, “Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation,” Genome Research, vol. 12, pp. 1574-1581, 2002.
[55] J. Chen, X. He, and L. Li, “Identifying the Patterns of Hematopoietic Stem Cells Gene Expressions Using Clustering Methods: Comparison and Summary,” J. Data Science, vol. 2, pp.297-379, 2004.
[56] K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, “Validating Clustering for Gene Expression Data,” Bioinformatics, vol. 17, no. 4, pp. 309-318, 2001.
[57] J. Handl, J. Knowles, and D.B. Kell, “Computational Cluster Validation in Post-Genomic Data Analysis,” Bioinformatics, vol. 21, no. 15, pp. 3201-3212, 2005.
[58] M.L. Chow, E.J. Moler, and I.S. Mian, “Identifying Marker Genes in Transcription Profiling Data Using a Mixture of Feature Relevance Experts,” Physiological Genomics, vol. 5, pp. 99-111, 2001.
[59] G. Getz, E. Levine, and E. Domany, “Coupled Two-Way Clustering Analysis of Gene Microarray Data,” Proc. Nat'l Academy of Science, vol. 97, no. 22, pp. 12 079-12 084, 2000.
[60] X. Qiu et al., “Human Epithelial Cancers Secrete Immunoglobulin G with Unidentified Specificity to Promote Growth and Survival of Tumor Cells,” Cancer Research, vol. 63, pp. 6488-6495, 2003.
[61] N. Tsai, B. Chen, S. Wei, C. Wu, and S.R. Roffler, “Anti-Tumor Immunoglobulin M Increases Lung Metastasis in an Experimental Model of Malignant Melanoma,” Clinical and Experimental Metastasis, vol. 20, pp. 103-109, 2003.
[62] T.J. Giordano et al., “Organ-Specific Molecular Classification of Primary Lung, Colon, and Ovarian Adenocarcinomas Using Gene Expression Profiles,” Am. J. Pathology, vol. 159, no. 4, pp. 1231-1238, 2001.
[63] M. Nacht et al., “Molecular Characteristics of Non-Small Cell Lung Cancer,” Proc. Nat'l Academy of Science, vol. 98, no. 26, pp. 15 203-15 208, 2001.
[64] M.Z. Man, X. Wang, and Y. Wang, “POWER_SAGE: Comparing Statistical Tests for Sage Experiments,” Bioinformatics, vol. 16, no. 11, pp. 953-959, 2000.

Index Terms:
microarray analysis, co-clustering, binormalization, deterministic spectral initialization, local search, Gene Ontology
Hyuk Cho, Inderjit S. Dhillon, "Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 5, no. 3, pp. 385-400, July-Sept. 2008, doi:10.1109/TCBB.2007.70268
Usage of this product signifies your acceptance of the Terms of Use.