CSDL Home IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010 vol.7 Issue No.01 - January-March

Subscribe

Issue No.01 - January-March (2010 vol.7)

pp: 153-165

Sara C. Madeira , Universidade da Beira Interior, Covilhã, KDBIO Group, INESC-ID, Lisbon, and Lisbon Technical University, Lisboa

Miguel C. Teixeira , Lisbon Technical University, Lisboa

Isabel Sá-Correia , Lisbon Technical University, Lisboa

Arlindo L. Oliveira , KDBIO Group, INESC-ID, Libson and Lisbon Technical University, Lisboa

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.34

ABSTRACT

Although most biclustering formulations are NP-hard, in time series expression data analysis, it is reasonable to restrict the problem to the identification of maximal biclusters with contiguous columns, which correspond to coherent expression patterns shared by a group of genes in consecutive time points. This restriction leads to a tractable problem. We propose an algorithm that finds and reports all maximal contiguous column coherent biclusters in time linear in the size of the expression matrix. The linear time complexity of CCC-Biclustering relies on the use of a discretized matrix and efficient string processing techniques based on suffix trees. We also propose a method for ranking biclusters based on their statistical significance and a methodology for filtering highly overlapping and, therefore, redundant biclusters. We report results in synthetic and real data showing the effectiveness of the approach and its relevance in the discovery of regulatory modules. Results obtained using the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress show not only the ability of the proposed methodology to extract relevant information compatible with documented biological knowledge but also the utility of using this algorithm in the study of other environmental stresses and of regulatory modules in general.

INDEX TERMS

Biclustering, time series gene expression data, expression patterns, regulatory modules.

CITATION

Sara C. Madeira, Miguel C. Teixeira, Isabel Sá-Correia, Arlindo L. Oliveira, "Identification of Regulatory Modules in Time Series Gene Expression Data Using a Linear Time Biclustering Algorithm",

*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol.7, no. 1, pp. 153-165, January-March 2010, doi:10.1109/TCBB.2008.34REFERENCES

- [1] Z. Bar-Joseph, "Analyzing Time Series Gene Expression Data,"
Bioinformatics, vol. 20, no. 16, pp. 2493-2503, 2004.- [2] C. Becquet, S. Blachon, B. Jeudy, J.-F. Boulicaut, and O. Gandrillon, "Strong-Association-Rule Mining for Large-Scale Gene-Expression Data Analysis: A Case Study on Human SAGE Data,"
Genome Biology, vol. 3, no. 12, 2002.- [3] A. Ben-Dor, B. Chor, R. Karp, and Z. Yakhini, "Discovering Local Structure in Gene Expression Data: The Order-Preserving Submatrix Problem,"
Proc. Sixth Int'l Conf. Computational Biology (RECOMB '02), pp. 49-57, 2002.- [4] S. Bleuler and E. Zitzler, "Order Preserving Clustering over Multiple Time Course Experiments,"
Proc. Third European Workshop Evolutionary Computation and Bioinformatics, pp. 33-43, 2005.- [5] Y. Cheng and G.M. Church, "Biclustering of Expression Data—Supplementary Information," http://arep.med.harvard. edubiclustering /, Sept. 2006.
- [6] Y. Cheng and G.M. Church, "Biclustering of Expression Data,"
Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '00), pp. 93-103, 2000.- [7] I.G. Costa, A. Schönhuth, and A. Schliep, "The Graphical Query Language: A Tool for Analysis of Gene Expression Time-Courses,"
Bioinformatics, vol. 21, no. 10, pp. 2544-2545, 2004.- [8] S. Erdal, O. Ozturk, D. Armbruster, H. Ferhatosmanoglu, and W.C. Ray, "A Time Series Analysis of Microarray Data,"
Proc. Fourth IEEE Symp. Bioinformatics and Bioeng. (BIBE '04), pp. 366-374, 2004.- [9] A.P. Gasch, P.T. Spellman, C.M. Kao, O. Carmel-Harel, M.B. Eisen, G. Storz, D. Botstein, and P.O. Brown, "Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes,"
Molecular Biology of the Cell, vol. 11, pp. 4241-4257, 2000.- [10] D. Gusfield, "Algorithms on Strings, Trees, and Sequences,"
Computer Science and Computational Biology Series. Cambridge Univ. Press, 1997.- [11] L. Ji and K. Tan, "Mining Gene Expression Data for Positive and Negative Co-Regulated Gene Clusters,"
Bioinformatics, vol. 20, no. 16, pp. 2711-2718, 2004.- [12] L. Ji and K. Tan, "Identifying Time-Lagged Gene Clusters Using Gene Expression Data,"
Bioinformatics, vol. 21, no. 4, pp. 509-516, 2005.- [13] N. Kobayashi and K. McEntee, "Identification of Cis and Trans Components of a Novel Heat Shock Stress Regulatory Pathway in Saccharomyces cerevisiae,"
Molecular and Cellular Biology, vol. 13, pp. 248-256, 1993.- [14] M. Koyuturk, W. Szpankowski, and A. Grama, "Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns,"
Proc. Eighth Int'l Conf. Research in Computational Molecular Biology (RECOMB '04), pp. 480-484, 2004.- [15] A. Kwon, H. Hoos, and R. Ng, "Inference of Transcriptional Regulation Relationships from Gene Expression Data,"
Bioinformatics, vol. 19, no. 8, pp. 905-912, 2003.- [16] J. Liu, W. Wang, and J. Yang, "Biclustering in Gene Expression Data by Tendency,"
Proc. Third Int'l IEEE CS Computational Systems Bioinformatics Conf. (CSB '04), pp. 182-193, 2004.- [17] J. Liu, W. Wang, and J. Yang, "A Framework for Ontology-Driven Subspace Clustering,"
Proc. ACM SIGKDD '04, pp. 623-628, 2004.- [18] J. Liu, W. Wang, and J. Yang, "Gene Ontology Friendly Biclustering of Expression Profiles,"
Proc. Third IEEE CS Computational Systems Bioinformatics Conf. (CSB '04), pp. 436-447, 2004.- [19] J. Liu, W. Wang, and J. Yang, "Mining Sequential Patterns from Large Data Sets,"
Advances in Database Systems, vol. 18, Kluwer Academic Publishers, 2005.- [20] S. Lonardi, W. Szpankowski, and Q. Yang, "Finding Biclusters by Random Projections,"
Proc. 15th Ann. Symp. Combinatorial Pattern Matching (CPM '04), pp. 102-116, 2004.- [21] S.C. Madeira and A.L. Oliveira, "Biclustering Algorithms for Biological Data Analysis: A Survey,"
IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 24-45, Jan./Mar. 2004.- [22] S.C. Madeira and A.L. Oliveira, "An Evaluation of Discretization Methods for Non-Supervised Analysis of Time-Series Gene Expression Data," Technical Report 42, INESC-ID, Dec. 2005.
- [23] S.C. Madeira and A.L. Oliveira, "A Linear Time Algorithm for Biclustering Time Series Expression Data,"
Proc. Fifth Workshop Algorithms in Bioinformatics (WABI '05), pp. 39-52, 2005.- [24] S.C. Madeira and A.L. Oliveira, "An Efficient Biclustering Algorithm for Finding Genes with Similar Patterns in Time-Series Expression Data,"
Proc. Fifth Asia-Pacific Bioinformatics Conf. (APBC '07), pp. 67-80, 2007.- [25] D. Martin, C. Brun, E. Remy, P. Mouren, D. Thieffry, and B. Jacq, "GOToolBox: Functional Investigation of Gene Datasets Based on Gene Ontology,"
Genome Biology, vol. 5, no. 12, p. R101, 2004.- [26] E. McCreight, "A Space Economical Suffix Tree Construction Algorithm,"
J. ACM, vol. 23, pp. 262-272, 1976.- [27] G.J. McLachlan, K. Do, and C. Ambroise, "Analysing Microarray Gene Expression Data,"
Wiley Series in Probability and Statistics, 2004.- [28] I. Van Mechelen, H.H. Bock, and P. De Boeck, "Two-Mode Clustering Methods: A Structured Overview,"
Statistical Methods in Medical Research, vol. 13, no. 5, pp. 979-981, 2004.- [29] C. Mollër-Levet, S. Cho, and O. Wolkenhauer, "DNA Microarray Data Clustering Based on Temporal Variation: FCV and TSD Preclustering,"
Applied Bioinformatics, vol. 2, no. 1, pp. 35-45, 2003.- [30] T.M. Murali and S. Kasif, "Extracting Conserved Gene Expression Motifs from Gene Expression Data,"
Proc. Eighth Pacific Symp. Biocomputing (PSB '03), vol. 8, pp. 77-88, 2003.- [31] R. Peeters, "The Maximum Edge Biclique Problem Is NP-Complete,"
Discrete Applied Math., vol. 131, no. 3, pp. 651-654, 2003.- [32] R.G. Pensa, C. Leschi, J. Besson, and J. Boulicaut, "Assessment of Discretization Techniques for Relevant Pattern Discovery from Gene Expression Data,"
Proc. Fourth Workshop Data Mining in Bioinformatics (BIOKDD), 2004.- [33] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Bühlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler, "A Systematic Comparison and Evaluation of Biclustering Methods for Gene Expression Data,"
Bioinformatics, vol. 22, no. 10, pp. 1282-1283, 2006.- [34] P. Weiner, "Linear Pattern Matching Algorithms,"
Proc. 14th IEEE Symp. Switching and Automata Theory (SWAT '73), pp. 1-11, 1973.- [35] Q. Sheng, Y. Moreau, and B. De Moor, "Biclustering Microarray Data by Gibbs Sampling,"
Bioinformatics, vol. 19, no. 2, pp. 196-205, 2003.- [36] T.M. Swan and K. Watson, "Stress Tolerance in a Yeast Sterol Auxotroph: Role of Ergosterol, Heat Shock Proteins and Trehalose,"
FEMS Microbiology Letters, vol. 7, pp. 169-191, 1998.- [37] A. Tanay, R. Sharan, and R. Shamir, "Discovering Statistically Significant Biclusters in Gene Expression Data,"
Bioinformatics, vol. 18, no. 1, pp. 136-144, 2002.- [38] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church, "Systematic Determination of Genetic Network Architecture,"
Nature Genetics, vol. 22, pp. 281-285, 1999.- [39] M.C. Teixeira, P. Monteiro, P. Jain, S. Tenreiro, A.R. Fernandes, N.P. Mira, M. Alenquer, A.T. Freitas, A.L. Oliveira, and I. Sá-Correia, "The YEASTRACT Database: A Tool for the Analysis of Transcription Regulatory Associations in Saccharomyces cerevisiae,"
Nucleic Acids Research, vol. 34, pp. D446-D451, Jan. 2006.- [40] E. Ukkonen, "On-Line Construction of Suffix Trees,"
Algorithmica, vol. 14, pp. 249-260, 1995.- [41] C. Wu, Y. Fu, T.M. Murali, and S. Kasif, "Gene Expression Module Discovery Using Gibbs Sampling,"
Genome Informatics, vol. 15, no. 1, pp. 239-248, 2004.- [42] Y. Zhang, H. Zha, and C.H. Chu, "A Time-Series Biclustering Algorithm for Revealing Co-Regulated Genes,"
Proc. Fifth IEEE Int'l Conf. Information Technology: Coding and Computing (ITCC '05), pp. 32-37, 2005. |