This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Biclustering Algorithms for Biological Data Analysis: A Survey
January-March 2004 (vol. 1 no. 1)
pp. 24-45

Abstract—A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications.

[1] R. Agrawal, J. Gehrke, D. Gunopulus, and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications Proc. ACM/SIGMOD Int'l Conf. Management of Data, pp. 94-105, 1998.
[2] A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, and L.M. Staudt, Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling Nature, vol. 403, pp. 503-511, 2000.
[3] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, Broad Patterns of Gene Expression Revealed by Clustering of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays Natural Academy of Sciences, vol. 96, no. 12, pp. 6745-6750, 1999.
[4] S.A. Armstrong, J.E. Staunton, L.B. Silverman, R. Pieters, M.L. den Boer, M.D. Minden, S.E. Sallan, E.S. Lander, T.R. Golub, and S.J. Korsmeyer, Mll Translocations Specify a Distinct Gene Expression Profile that Distinguishes a Unique Leukemia Nature Genetics, vol. 30, pp. 41-47, 2002.
[5] P. Baldi and G.W. Hatfield, DNA Microarrays and Gene Expression. From Experiments to Data Analysis and Modelling. Cambridge Univ. Press, 2002.
[6] A. Ben-Dor, B. Chor, R. Karp, and Z. Yakhini, Discovering Local Structure in Gene Expression Data: The Order-Preserving Submatrix Problem Proc. Sixth Int'l Conf. Computational Biology (RECOMB '02), pp. 49-57, 2002.
[7] P. Berkhin and J.D. Becher, Learning Simple Relations: Theory and Applications Proc. Second SIAM Int'l Conf. Data Mining, pp. 420-436, 2002.
[8] S. Busygin, G. Jacobsen, and E. Kramer, Double Conjugated Clustering Applied to Leukemia Microarray Data Proc. Second SIAM Int'l Conf. Data Mining, Workshop Clustering High Dimensional Data, 2002.
[9] A. Califano, G. Stolovitzky, and Y. Tu, Analysis of Gene Expression Microarays for Phenotype Classification Proc. Int'l Conf. Computacional Molecular Biology, pp. 75-85, 2000.
[10] Y. Cheng and G.M. Church, Biclustering of Expression Data Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '00), pp. 93-103, 2000.
[11] H. Cho, I.S. Dhillon, Y. Guan, and S. Sra, Minimum Sum-Squared Residue Cococlustering of Gene Expression Data Proc. Fourth SIAM Int'l Conf. Data Mining, 2004.
[12] R.J. Cho, M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J. Lockhart, and R.W. Davis, A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle Molecular Cell, vol. 2, pp. 65-73, 1998.
[13] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms, The MIT Electrical Eng, and Computer Science Series, The MIT Press, second ed., 2001.
[14] I.S. Dhillon, Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), pp. 269-274, 2001.
[15] I.S. Dhillon, S. Mallela, and D.S. Modha, Information-Theoretical Coclustering Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '03), pp. 89-98, 2003.
[16] D. Duffy and A. Quiroz, A Permutation Based Algorithm for Block Clustering J. Classification, vol. 8, pp. 65-91, 1991.
[17] N. Friedman and M. Goldszmidt, Learning Bayesian Networks with Local Structure Learning in Graphical Models, Kluwer, pp. 421-460, 1998.
[18] A.P. Gasch, M. Huang, S. Metzner, D. Botstein, S.J. Elledge, and P.O. Brown, Genomic Expression Responses to DNA-Damaging Agents and the Regulatory Role of the Yeast ATR Homolog mec1p Molecular Biology of the Cell, vol. 12, pp. 2987-3003, 2001.
[19] A.P. Gasch, P.T. Spellman, C.M. Kao, O. Carmel-Harel, M.B. Eisen, G. Storz, D. Botstein, and P.O. Brown, Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes Molecular Biology of the Cell, vol. 11, pp. 4241-4257, 2000.
[20] W. Gaul and M. Schader, A New Algorithm for Two-Mode Clustering Data Analysis and Information Systems, H. Hermann and W. Polasek, eds., Springer, pp. 15-23, 1996.
[21] G. Getz, E. Levine, and E. Domany, Coupled Two-Way Clustering Analysis of Gene Microarray Data Proc. Natural Academy of Sciences US, pp. 12079-12084, 2000.
[22] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Science, vol. 286, pp. 531-537, 1999.
[23] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology Series, Cambridge Univ. Press, 1997.
[24] J.A. Hartigan, Direct Clustering of a Data Matrix J. Am. Statistical Assoc. (JASA), vol. 67, no. 337, pp. 123-129, 1972.
[25] I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, M. Raffeld, Z. Yakhini, A. Ben-Dor, E. Dougherty, J. Kononen, L. Bubendorf, W. Fehrle, S. Pittaluga, S. Gruvberger, N. Loman, O. Johannsson, H. Olsson, B. Wilfond, G. Sauter, O.P. Kallioniemi, A. Borg, and J. Trent, Gene-Expression Profiles in Hereditary Breast Cancer New England J. Medicine, vol. 344, no. 8, pp. 539-548, 2000.
[26] J. Hipp, U. Güntzer, and G. Nakhaeizadeh, Algorithms for Association Rule Mining A General Survey and Comparison SIGKDD Explorations, vol. 2, no. 1, pp. 58-64, July 2000.
[27] T. Hofmann and J. Puzicha, Latent Class Models for Collaborative Filtering Proc. Int'l Joint Conf. Artificial Intelligence, pp. 668-693, 1999.
[28] T.R. Hughes, M.J. Marton, A.R. Jones, C.J. Roberts, R. Stoughton, C.D. Armour, H.A. Bennett, E. Coffey, H. Dai, Y.D. He, M.J. Kidd, A.M. King, M.R. Meyer, D. Slade, P.Y. Lum, S.B. Stepaniants, D.D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S.H. Friend, Functional Discovery via a Compendium of Expression Profiles Cell, vol. 102, pp. 109-126, 2000.
[29] T. Ideker, V. Thorsson, J.A. Ranish, R. Christmas, J. Buhler, J.K. Eng, R. Bumgarner, D.R. Goodlett, ? Aebersold, and L. Hood, Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network Science, vol. 292, pp. 929-934, 2001.
[30] V.R. Iyer, M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J.C.F. Lee, J.M. Trent, L.M. Staudt, J. Hudson Jr., M.S. Boguski, D. Lashkari, D. Shalon, D. Botstein, and P.O. Brown, The Transcriptional Program in the Response of Human Fibroblasts to Serum Science, vol. 283, pp. 83-87, 1999.
[31] U. Klein, Y. Tu, G.A. Stolovitzky, M. Mattioli, G. Cattoretti, H. Husson, A. Freedman, G. Inghirami, L. Cro, L. Baldini, A. Neri, A. Califano, and R. Dalla-Favera, Gene Expression Profiling of B-Cell Chronic Lymphocytic Leukemia Reveals a Homogeneous Phenotype Related to Memory B Cells J. Experimental Medicine, vol. 194, pp. 1625-1638, 2001.
[32] Y. Klugar, R. Basri, J.T. Chang, and M. Gerstein, Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions Genome Research, vol. 13, pp. 703-716, 2003.
[33] U. Kluger, B. Kacinski, Y. Kluger, O. Mironenko, M. Gilmore-Hebert, J. Chang, A. Perkins, and E. Sapi, Microarray Analysis of Invasive and Metatastic Phenotypes in a Breast Cancer Model Poster Presented at the Gordon Conf. Cancer, 2001.
[34] L. Lazzeroni and A. Owen, Plaid Models for Gene Expression Data technical report, Stanford Univ., 2000.
[35] J. Liu and W. Wang, OP-Cluster: Clustering by Tendency in High Dimensional Space Proc. Third IEEE Int'l Conf. Data Mining, pp. 187-194, 2003.
[36] B. Mirkin, Nonconvex Optimization and its Applications Math. Classification and Clustering, Kluwer Academic Publishers, 1996.
[37] T.M. Murali and S. Kasif, Extracting Conserved Gene Expression Motifs from Gene Expression Data Proc. Pacific Symp. Biocomputing, vol. 8, pp. 77-88, 2003.
[38] R. Peeters, The Maximum Edge Biclique Problem is NP-Complete Discrete Applied Math., vol. 131, no. 3, pp. 651-654, 2003.
[39] S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M.E. McLaughlin, J.Y. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, J.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander, and T.R. Golub, Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression Nature, vol. 415, no. 6870, pp. 436-442, 2002.
[40] E. Segal, A. Battle, and D. Koller, Decomposing Gene Expression into Cellular Processes Proc. Pacific Symp. Biocomputing, vol. 8, pp. 89-100, 2003.
[41] E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller, Rich Probabilistic Models for Gene Expression Bioinformatics, vol. 17, pp. S243-S252, 2001.
[42] Q. Sheng, Y. Moreau, and B. De Moor, Biclustering Microarray Data by Gibbs Sampling Bioinformatics, vol. 19, pp. ii196-ii205, 2003.
[43] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Futcher, Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization Molecular Biology of the Cell, vol. 9, pp. 3273-3297, 1998.
[44] A. Tanay, R. Sharan, and R. Shamir, Discovering Statistically Significant Biclusters in Gene Expression Data Bioinformatics, vol. 18, pp. S136-S144, 2002.
[45] C. Tang, L. Zhang, I. Zhang, and M. Ramanathan, Interrelated Two-Way Clustering: An Unsupervised Approach for Gene Expression Data Analysis Proc. Second IEEE Int'l Symp. Bioinformatics and Bioeng., pp. 41-48, 2001.
[46] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown, Clustering Methods for the Analysis of DNA Microarray Data technical report, Dept. of Health Research and Policy, Dept. of Genetics, and Dept. of Biochemestry, Stanford Univ., 1999.
[47] L. Ungar and D.P. Foster, A Formal Statistical Approach to Collaborative Filtering Proc. Conf. Automated Learning and Discovery (CONALD '98), 1998.
[48] H. Wang, W. Wang, J. Yang, and P.S. Yu, Clustering by Pattern Similarity in Large Data Sets Proc. 2002 ACM SIGMOD Int'l Conf. Management of Data, pp. 394-405, 2002.
[49] J.N. Weinstein, T.G. Myers, P.M. O'Connor, S.H. Friend, A.J. Fornace Jr., K.W. Kohn, T. Fojo, S.E. Bates, L.V. Rubinstein, N.L. Anderson, J.K. Buolamwini, W.W. Van Osdol, A.P. Monks, D.A. Scudiero, E.A. Sausville, D.W. Zaharevitz, B. Bunow, V.N. Viswanadhan, G.S. Johnson, R.E. Wittes, and K.D. Paull, An Information-Intensive Approach to the Molecular Pharmacology of Cancer Science, vol. 275, pp. 343-349, 1997.
[50] J. Yang, W. Wang, H. Wang, and P. Yu, $\delta{\hbox{-}}{\rm{Clusters}}$: Capturing Subspace Correlation in a Large Data Set Proc. 18th IEEE Int'l Conf. Data Eng., pp. 517-528, 2002.
[51] J. Yang, W. Wang, H. Wang, and P. Yu, Enhanced Biclustering on Expression Data Proc. Third IEEE Conf. Bioinformatics and Bioeng., pp. 321-327, 2003.
[52] V. Yong, S. Chabot, Q. Stuve, and G. Williams, Interferon Beta in the Treatment of Multiple Sclerosis: Mechanisms of Action Neurology, vol. 51, pp. 682-689, 1998.

Index Terms:
Biclustering, simultaneous clustering, coclustering, subspace clustering, bidimensional clustering, direct clustering, block clustering, two-way clustering, two-mode clustering, two-sided clustering, microarray data analysis, biological data analysis, gene expression data.
Citation:
Sara C. Madeira, Arlindo L. Oliveira, "Biclustering Algorithms for Biological Data Analysis: A Survey," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 24-45, Jan.-March 2004, doi:10.1109/TCBB.2004.2
Usage of this product signifies your acceptance of the Terms of Use.