The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - July-Aug. (2013 vol.10)
pp: 1080-1085
Tianwei Yu , Dept. of Biostat. & Bioinf., Emory Univ., Atlanta, GA, USA
Hesen Peng , Dept. of Biostat. & Bioinf., Emory Univ., Atlanta, GA, USA
ABSTRACT
High-throughput expression technologies, including gene expression array and liquid chromatography--mass spectrometry (LC-MS) and so on, measure thousands of features, i.e., genes or metabolites, on a continuous scale. In such data, both linear and nonlinear relations exist between features. Nonlinear relations can reflect critical regulation patterns in the biological system. However, they are not identified and utilized by traditional clustering methods based on linear associations. Clustering based on general dependences, i.e., both linear and nonlinear relations, is hampered by the high dimensionality and high noise level of the data. We developed a sensitive nonparametric measure of general dependence between (groups of) random variables in high dimensions. Based on this dependence measure, we developed a hierarchical clustering method. In simulation studies, the method outperformed correlation- and mutual information (MI)-based hierarchical clustering methods in clustering features with nonlinear dependences. We applied the method to a microarray data set measuring the gene expression in cell-cycle time series to show it generates biologically relevant results. The R code is available at http://userwww.service.emory.edu/~tyu8/GDHC.
INDEX TERMS
statistical analysis, bioinformatics, cellular biophysics, genetics, lab-on-a-chip, cell-cycle time series, high- throughput expression data, general dependence sensitive nonparametric measure, high-throughput expression technology, gene expression array, liquid chromatography-mass spectrometry, LC-MS method, metabolite, feature linear relation, feature nonlinear relation, biological system critical regulation pattern, linear association, data high dimensionality effect, data high noise level effect, high dimension random variable, simulation study, correlation-based hierarchical clustering method, mutual information-based hierarchical clustering method, feature nonlinear dependence clustering, microarray data set, gene expression measurement, Noise, Vectors, Couplings, Clustering methods, Random variables, Standards, Bioinformatics, similarity measures, Algorithms, clustering
CITATION
Tianwei Yu, Hesen Peng, "Hierarchical Clustering of High- Throughput Expression Data Based on General Dependences", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 4, pp. 1080-1085, July-Aug. 2013, doi:10.1109/TCBB.2013.99
REFERENCES
[1] K. Dettmer, P.A. Aronov, and B.D. Hammock, "Mass Spectrometry-Based Metabolomics," Mass Spectrom Rev., vol. 26, pp. 51-78, Jan./Feb. 2007.
[2] T. Yu, Y. Park, J.M. Johnson, and D.P. Jones, "apLCMS-Adaptive Processing of High-Resolution LC/MS Data," Bioinformatics, vol. 25, pp. 1930-1936, Aug. 2009.
[3] J.M. Johnson, T. Yu, F.H. Strobel, and D.P. Jones, "A Practical Approach to Detect Unique Metabolic Patterns for Personalized Medicine," The Analyst, vol. 135, pp. 2864-2870, Nov. 2010.
[4] T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Statistical Learning : Data Mining, Inference, and Prediction, second ed. Springer, 2009.
[5] T.W. Yu, "An Exploratory Data Analysis Method to Reveal Modular Latent Structures in High-Throughput Data," BMC Bioinformatics, vol. 11, article 440, Aug. 2010.
[6] H. Zou, T. Hastie, and R. Tibshirani, "Sparse Principal Component Analysis," J. Computational and Graphical Statistics, vol. 15, pp. 265-286, Jun. 2006.
[7] G. Gan, C. Ma, and J. Wu, Data Clustering: Theory, Algorithms, and Applications. SIAM Am. Statistical Assoc., 2007.
[8] R. Nugent and M. Meila, "An Overview of Clustering Applied to Molecular Biology," Methods Molecular Biology, vol. 620, pp. 369-404, 2010.
[9] Z. Bar-Joseph et al., "Computational Discovery of Gene Modules and Regulatory Networks," Nature Biotechnology, vol. 21, pp. 1337-1342, Nov. 2003.
[10] T. Barrett et al., "NCBI GEO: Mining Tens of Millions of Expression Profiles-Database and Tools Update," Nucleic Acids Research, vol. 35, pp. D760-D765, Jan. 2007.
[11] M. Shah and J. Corbeil, "A General Framework for Analyzing Data from Two Short Time-Series Microarray Experiments," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 1, pp. 14-26, Jan.-Mar. 2011.
[12] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, "Cluster Analysis and Display of Genome-Wide Expression Patterns," Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 14863-14868, Dec. 1998.
[13] D.N. Reshef et al., "Detecting Novel Associations in Large Data Sets," Science, vol. 334, pp. 1518-1524, Dec. 2011.
[14] K.C. Li, C.T. Liu, W. Sun, S. Yuan, and T. Yu, "A System for Enhancing Genome-Wide Coexpression Dynamics Study," Proc. Nat'l Academy of Sciences USA, vol. 101, pp. 15561-6, Nov. 2004.
[15] K.C. Li, M. Yan, and S.S. Yuan, "A Simple Statistical Model for Depicting the Cdc15-Synchronized Yeast Cell-Cycle Regulated Gene Expression Data," Statistica Sinica, vol. 12, pp. 141-158, Jan. 2002.
[16] W. Luo, K.D. Hankenson, and P.J. Woolf, "Learning Transcriptional Regulatory Networks from High Throughput Gene Expression Data Using Continuous Three-Way Mutual Information," BMC Bioinformatics, vol. 9, article 467, 2008.
[17] P.E. Meyer, F. Lafitte, and G. Bontempi, "Minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information," BMC Bioinformatics, vol. 9, article 461, 2008.
[18] T. Suzuki, M. Sugiyama, T. Kanamori, and J. Sese, "Mutual Information Estimation Reveals Global Associations between Stimuli and Biological Processes," BMC Bioinformatics, vol. 10, no. Suppl 1, article S52, 2009.
[19] T. Yu, H. Peng, and W. Sun, "Incorporating Nonlinear Relationships in Microarray Missing Value Imputation," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 3, pp. 723-731, May/June 2011.
[20] B. Kegl, A. Krzyzak, T. Linder, and K. Zeger, "Learning and Design of Principal Curves," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 3, pp. 281-297, 2000.
[21] P. Delicado and M. Smrekar, "Measuring Non-Linear Dependence for Two Random Variables Distributed along a Curve," Statistics and Computing, vol. 19, pp. 255-269, 2008.
[22] P. Billingsley, Probability and Measure, third ed. Wiley, 1995.
[23] G. Gutin and A.P. Punnen, The Traveling Salesman Problem and Its Variations. Kluwer Academic, 2002.
[24] D. Applegate, W. Cook, and A. Rohe, "Chained Lin-Kernighan for Large Traveling Salesman Problems," Informs J. Computing, vol. 15, pp. 82-92, 2000.
[25] S. Song and M.A. Black, "Microarray-Based Gene Set Analysis: A Comparison of Current Methods," BMC Bioinformatics, vol. 9, article 502, 2008.
[26] P. Langfelder, B. Zhang, and S. Horvath, "Defining Clusters from a Hierarchical Cluster Tree: The Dynamic Tree Cut Package for R," Bioinformatics, vol. 24, pp. 719-720, Mar. 2008.
[27] H. Joe, "Relative Entropy Measures of Multivariate Dependence," J. Am. Statistical Assoc., vol. 84, pp. 157-164, Mar. 1989.
[28] W.M. Rand, "Objective Criteria for the Evaluation of Clustering Methods," J. Am. Statistical Assoc., vol. 66, pp. 846-850, 1971.
[29] T. Yu et al., "A Forward-Backward Fragment Assembling Algorithm for the Identification of Genomic Amplification and Deletion Breakpoints Using High-Density Single Nucleotide Polymorphism (SNP) Array," BMC Bioinformatics, vol. 8, article 145, 2007.
[30] T. Yu, "ROCS: Receiver Operating Characteristic Surface for Class-Skewed High-Throughput Data," PloS One, vol. 7, article e40598, 2012.
[31] P.T. Spellman et al., "Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization," Molecular Biology Cell, vol. 9, pp. 3273-3297, Dec. 1998.
[32] T. Yu and K.C. Li, "Inference of Transcriptional Regulatory Network by Two-Stage Constrained Space Factor Analysis," Bioinformatics, vol. 21, pp. 4033-4038, Nov. 2005.
[33] M. Ashburner et al., "Gene Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium," Nature Genetics, vol. 25, pp. 25-29, May 2000.
[34] J.M. Cherry et al., "Saccharomyces Genome Database: The Genomics Resource of Budding Yeast," Nucleic Acids Research, vol. 40, pp. D700-D705, Jan. 2012.
[35] T. Yu, W. Sun, S. Yuan, and K.C. Li, "Study of Coordinative Gene Expression at the Biological Process Level," Bioinformatics, vol. 21, pp. 3651-3657, Sept. 2005.
[36] T. Yu and Y. Bai, "Capturing Changes in Gene Expression Dynamics by Gene Set Differential Coordination Analysis," Genomics, vol. 98, pp. 469-477, Dec. 2011.
[37] T. Yu and Y. Bai, "Improving Gene Expression Data Interpretation by Finding Latent Factors that Co-Regulate Gene Modules with Clinical Factors," BMC Genomics, vol. 12, article 563, 2011.
[38] S. Wichert, K. Fokianos, and K. Strimmer, "Identifying Periodically Expressed Transcripts in Microarray Time Series Data," Bioinformatics, vol. 20, pp. 5-20, Jan. 2004.
[39] S. Falcon and R. Gentleman, "Using GOstats to Test Gene Lists for GO Term Association," Bioinformatics, vol. 23, pp. 257-258, Jan. 2007.
76 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool