The Community for Technology Leaders
RSS Icon
Issue No.03 - May/June (2011 vol.8)
pp: 808-818
Banu Dost , University of California, San Diego, La Jolla
Chunlei Wu , The Genomics Institute of the Novartis Research Foundation, La Jolla
Andrew Su , The Genomics Institute of the Novartis Research Foundation, La Jolla
Vineet Bafna , University of California, San Diego, La Jolla
Genes with a common function are often hypothesized to have correlated expression levels in mRNA expression data, motivating the development of clustering algorithms for gene expression data sets. We observe that existing approaches do not scale well for large data sets, and indeed did not converge for the data set considered here. We present a novel clustering method TCLUST that exploits coconnectedness to efficiently cluster large, sparse expression data. We compare our approach with two existing clustering methods CAST and K-means which have been previously applied to clustering of gene-expression data with good performance results. Using a number of metrics, TCLUST is shown to be superior to or at least competitive with the other methods, while being much faster. We have applied this clustering algorithm to a genome-scale gene-expression data set and used gene set enrichment analysis to discover highly significant biological clusters. (Source code for TCLUST is downloadable at
Microarray expression, clustering, graph algorithms, coconnectedness.
Banu Dost, Chunlei Wu, Andrew Su, Vineet Bafna, "TCLUST: A Fast Method for Clustering Genome-Scale Expression Data", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 3, pp. 808-818, May/June 2011, doi:10.1109/TCBB.2010.34
[1] R. Shamir, R. Sharan, and D. Tsur, "Cluster Graph Modification Problems," Discrete Applied Math., vol. 144, nos. 1/2, pp. 173-182, , 2004.
[2] S. Delvaux and L. Horsten, "On Best Transitive Approximations to Simple Graphs," Acta Informatica, vol. 40, no. 9, pp. 637-655, 2004.
[3] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, 1979.
[4] J. Quackenbush, "Computational Analysis of Microarray Data," Nature Rev. Genetics, vol. 2, no. 6, pp. 418-427,, June 2001.
[5] M. Gerstein and R. Jansen, "The Current Excitement in Bioinformatics—Analysis of Whole Genome Expression Data: How Does It Related to Protein Structure and Function," edugerstein00current.html , 2000.
[6] A. Ben-Dor, R. Shamir, and Z. Yakhini, "Clustering Gene Expression Patterns," J. Computational Biology, vol. 6, nos. 3/4, pp. 281-297, , 1999.
[7] F.D. Gibbons and F.P. Roth, "Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation," Genome Research, vol. 12, no. 10, pp. 1574-1581, http://www.hubmed.orgdisplay.cgi?uids=12368250 , Oct. 2002.
[8] S. Rahmann, T. Wittkop, J. Baumbach, M. Martin, A. Truss, and S. Böcker, "Exact and Heuristic Algorithms for Weighted Cluster Editing," Proc. Computational Systems Bioinformatics Conf, vol. 6, pp. 391-401, , 2007.
[9] E.E. Schadt, S.A. Monks, T.A. Drake, A.J. Lusis, N. Che, V. Colinayo, T.G. Ruff, S.B. Milligan, J.R. Lamb, G. Cavet, P.S. Linsley, M. Mao, R.B. Stoughton, and S.H. Friend, "Genetics of Gene Expression Surveyed in Maize, Mouse and Man," Nature, vol. 422, no. 6929, pp. 297-302,, Mar. 2003.
[10] C. Huttenhower, A.I. Flamholz, J.N. Landis, S. Sahi, C.L. Myers, K.L. Olszewski, M.A. Hibbs, N.O. Siemers, O.G. Troyanskaya, and H.A. Coller, "Nearest Neighbor Networks: Clustering Expression Data Based on Gene Neighborhoods," BMC Bioinformatics, vol. 8, pp. 250-262, 2007.
[11] N. Song, J.M. Joseph, G.B. Davis, and D. Durand, "Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins," PLoS Computational Biology, vol. 4, no. 4, , Apr. 2008.
[12] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge Univ. Press, 1995.
[13] D. Gibson, R. Kumar, and A. Tomkins, "Discovering Large Dense Subgraphs in Massive Graphs," Proc. 31st Int'l Conf. Very Large Data Bases (VLDB '05), pp. 721-732, 2005.
[14] R. Project, The R Project for Statistical Computing, www.r-project. org, 2003.
[15] M. Kanehisa, M. Araki, S. Goto, M. Hattori, M. Hirakawa, M. Itoh, T. Katayama, S. Kawashima, S. Okuda, T. Tokimatsu, and Y. Yamanishi, "Kegg for Linking Genomes to Life and the Environment," Nucleic Acids Research, gkm882+, vol. 36, pp. D480-484,, Dec. 2007.
[16] Ingenuity Systems "Ingenuity Systems Pathway Analysis,", 2010.
[17] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, and G. Sherlock, "Gene Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium," Nature Genetics, vol. 25, no. 1, pp. 25-29,, May 2000.
[18] M.A. Bogue, S.C. Grubb, T.P. Maddatu, and C.J. Bult, "Mouse Phenome Database (mpd)," Nucleic Acids Research, vol. 35, pp. 643-649, nar35.html# BogueGMB07, 2007.
[19] Z. Yingyao, A. Young, S. Andrey, C. Kaisheng, Y.S. Frank, and A. Winzeler, "In Silico Gene Function Prediction Using Ontology-Based Pattern Identification," Bioinformatics, vol. 21, no. 7, pp. 1237-1245, bti111, Apr. 2005.
[20] E.H.A.M. Gordon and M. Regnier, "Regulation of Contraction in Striated Muscle," 2000.
[21] C.A. Conley,K.L. Fritz-Six, A. Almenar-Queralt, and V.M. Fowler, "Leiomodins: Larger Members of the Tropomodulin (tmod) Gene Family," Genomics, vol. 73, no. 1, pp. 127-139, http://www.sciencedirect. com/science/article/ B6WG1-45BCMW0-1/2e5ef9707c be3d3959295dd41cf114e56 , May 2001.
[22] K. Brix, P. Lemansky, and V. Herzog, "Evidence for Extracellularly Acting Cathepsins Mediating Thyroid Hormone Liberation in Thyroid Epithelial Cells," Endocrinology, vol. 137, no. 5, pp. 1963-1974, http://www.hubmed.orgdisplay.cgi? uids=8612537 , May 1996.
[23] L. Gautier, L. Cope, B.M. Bolstad, and R.A. Irizarry, "Affy—Analysis of Affymetrix Genechip Data at the Probe Level," Bioinformatics, vol. 20, no. 3, pp. 307-315, 2004.
[24] R.C. Gentleman, V.J. Carey, D.M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A.J. Rossini, G. Sawitzki, C. Smith, G. Smythy, L. Tierney, J.Y. Yang, and J. Zhang, "Bioconductor: Open Software Development for Computational Biology and Bioinformatics," Genome Biology, vol. 5, no. 10, pp. 80-95, , 2004.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool