The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - September/October (2011 vol.8)
pp: 1153-1169
Dongxiao Zhu , University of New Orleans, New Orleans, Children's Hospital, New Orleans, and Tulane University Cancer Center, New Orleans
Lipi Acharya , University of New Orleans, New Orleans
Hui Zhang , Novartis Pharmaceutical Corporation, East Hanover
ABSTRACT
Estimation of pairwise correlation from incomplete and replicated molecular profiling data is an ubiquitous problem in pattern discovery analysis, such as clustering and networking. However, existing methods solve this problem by ad hoc data imputation, followed by aveGation coefficient type approaches, which might annihilate important patterns present in the molecular profiling data. Moreover, these approaches do not consider and exploit the underlying experimental design information that specifies the replication mechanisms. We develop an Expectation-Maximization (EM) type algorithm to estimate the correlation structure using incomplete and replicated molecular profiling data with a priori known replication mechanism. The approach is sufficiently generalized to be applicable to any known replication mechanism. In case of unknown replication mechanism, it is reduced to the parsimonious model introduced previously. The efficacy of our approach was first evaluated by comprehensively comparing various bivariate and multivariate imputation approaches using simulation studies. Results from real-world data analysis further confirmed the superior performance of the proposed approach to the commonly used approaches, where we assessed the robustness of the method using data sets with up to 30 percent missing values.
INDEX TERMS
Replicated data, pairwise correlation, pattern recognition, unsupervised learning, missing value.
CITATION
Dongxiao Zhu, Lipi Acharya, Hui Zhang, "A Generalized Multivariate Approach to Pattern Discovery from Replicated and Incomplete Genome-Wide Measurements", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 5, pp. 1153-1169, September/October 2011, doi:10.1109/TCBB.2010.102
REFERENCES
[1] T.W. Anderson, An Introduction to Mutilvariate Statistical Analysis. Wiley, 1958.
[2] K. Basso, A.A. Margolin, G. Stolovitzky, U. Klein, R. Dalla-Favera, and A. Califano, “Reverse Engineering of Regulatory Networks in Human B Cells,” Nature Genetics, vol. 37, pp. 382-390, 2005.
[3] R. Boscolo, J. Liao, and V.P. Roychowdhury, “An Information Theoretic Exploratory Method for Learning Patterns of Conditional Gene Coexpression from Microarray Data,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 5, no. 1, pp. 15-24, Jan.-Mar. 2008.
[4] A.S. Bryk and S.W. Raudenbush, Hierarchical Linear Models: Applications and Data Analysis Methods. Sage, 1992.
[5] A.J. Butte and I.S. Kohane, “Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy Measurements,” Proc. Pacific Symp. Biocomputing, vol. 5, pp. 415-426, 2000.
[6] G. Casella and R.L. Berger, “Statistical Inference,” Proc. Duxbury Advanced Series, 1990.
[7] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, no. 1, pp. 1-38, 1977.
[8] M. Eisen, P. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998.
[9] K.L. Gunderson, S. Kruglyak, M.S. Graige, F. Garcia, B.G. Kermani, C. Zhao, D. Che, T. Dickinson, E. Wickham, J. Bierle, D. Doucet, M. Milewski, R. Yang, C. Siegmund, J. Haas, L. Zhou, A. Oliphant, J.B. Fan, S. Barnard, and M.S. Chee, “Decoding Randomly Ordered DNA Arrays,” Genome Research, vol. 14, pp. 870-877, 2004.
[10] M.J.L. de Hoon, S. Imoto, J. Nolan, and S. Miyano, “Open Source Clustering Software,” Bioinformatics, vol. 20, no. 9, pp. 1453-1454, 2004.
[11] E. Hubbell, W.M. Liu, and R. Mei, “Robust Estimators for Expression Analysis,” Bioinformatics, vol. 18, no. 12, pp. 1585-1592, 2001.
[12] T.R. Hughes, M.J. Marton, A.R. Jones, C.J. Roberts, R. Stoughton, C.D. Armour, H.A. Bennett, E. Coffey, H. Dai, and Y.D. He, “Functional Discovery via a Compendium of Expression Profiles,” Cell, vol. 102, pp. 109-126, 2000.
[13] T. Ideker et al., “Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data,” J. Computational Biology, vol. 7, pp. 805-817, 2000.
[14] R.A. Irizarry, B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, U. Scherf, and T.P. Speed, “Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data,” Biostatistics, vol. 4, pp. 249-264, 2003.
[15] M.K. Kerr and G.A. Churchill, “Experimental Design for Gene Expression Microarrays,” Biostatistics, vol. 2, pp. 183-201, 2001.
[16] N.M. Laird and J.H. Ware, “Random-Effects Models for Longitudinal Data,” Biometrics, vol. 38, pp. 963-974, 1982.
[17] C. Li and W.H. Wong, “Model-Based Analysis of Oligonucleotide Arrays: Expression Score Computation and Outlier Detection,” Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 31-36, 2001.
[18] D.J. Lockhart, H. Dong, M.C. Byrne, M.T. Follettie, M.V. Gallo, M.S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, and E.L. Brown, “Expression Monitoring by Hybridization to High-Density Oligonucleotide Arrays,” Nature Biotechnology, vol. 14, pp. 1675-1680, 1996.
[19] P. Mahata, “Exploratory Consensus of Hierarchical Clusterings for Melanoma and Breast Cancer,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 138-152, Jan.-Mar. 2010.
[20] A.A. Margolin, K. Wang, W.K. Lim, M. Kustagi, I. Nemenman, and A. Califano, “Reverse Engineering Cellular Networks,” Nature Protocols, vol. 1, no. 2, pp. 662-671, 2006.
[21] M. Medvedovic and S. Sivaganesan, “Bayesian Infinite Mixture Model Based Clustering of Gene Expression Profiles,” Bioinformatics, vol. 18, pp. 1194-1206, 2002.
[22] M. Medvedovic, K.Y. Yeung, and R.E. Bumgarner, “Bayesian Mixtures for Clustering Replicated Microarray Data,” Bioinformatics, vol. 20, pp. 1222-1232, 2004.
[23] A. Mortazavi, B. Williams, K. McCue, L. Schaeffer, and B. Wold, “Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq,” Nature Method, vol. 5, pp. 621-628, 2008.
[24] P. Pehkonen, G. Wong, and P. Törönen, “Heuristic Bayesian Segmentation for Discovery of Co-Expressed Genes within Genomic Regions,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 37-49, Jan.-Mar. 2010.
[25] J.C. Pinheiro and D.M. Bates, Mixed-Effects Models in S and S-PLUS. Springer, 2000.
[26] S.W. Raudenbush and A.S. Bryk, Hierarchical Linear Models: Applications and Data Analysis Methods, second ed., Sage. 2002.
[27] P.L. Ross and Y.N. Huang, “Multiplexed Protein Quantization in Saccharomyces Cerevisiae Using Amine-Reactive Isobaric Tagging Reagents,” Molecular & Cellular Proteomics, vol. 3, pp. 1154-1169, 2004.
[28] J. Schäfer and K. Strimmer, “An Empirical Bayes Approach to Inferring Large-Scale Gene Association Networks,” Bioinformatics, vol. 21, pp. 754-764, 2005.
[29] K. Shedden and J. Taylor, “Differential Correlation Detects Complex Associations between Gene Expression and Clinical Outcomes in Lung Adenocarcinomas,” Methods of Microarray Data Analysis IV, J. Shoemaker, ed., Kluwer, 2004.
[30] J. Shendure and H. Ji, “Next-Generation DNA Sequencing,” Nature Biotechnology, vol. 26, pp. 1135-1145, 2008.
[31] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R.B. Altman, “Missing Value Estimation Methods for DNA Microarrays,” Bioinformatics, vol. 17, pp. 520-525, 2001.
[32] V.G. Tusher, R. Tibshirani, and G. Chu, “Significance Analysis of Microarrays Applied to the Ionizing Radiation Response,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 9, pp. 5116-5121, 2001.
[33] K. Wang, I. Nemenman, N. Banerjee, A.A. Margolin, and A. Califano, “Genome-Wide Discovery of Modulators of Transcriptional Interactions in Human B Lymphocytes,” Proc. Int'l Conf. Research in Computational Molecular Biology (RECOMB '06), pp. 348-362, 2006.
[34] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: A Revolutionary Tool for Transcriptomics,” Nature Rev. Genetics, vol. 10, pp. 57-63, 2009.
[35] Z. Wu and R.A. Irizarry, “Stochastic Models Inspired by Hybridization Theory for Short Oligonucleotide Arrays,” J. Computational Biology, vol. 12, pp. 882-893, 2005.
[36] J. Yao, C. Chang, M.L. Salmi, Y.S. Hung, A. Loraine, and S.J. Roux, “Genome-Scale Cluster Analysis of Replicated Microarrays Using Shrinkage Correlation Coefficient,” BMC Bioinformatics, vol. 9, article no. 288, 2008.
[37] K.Y. Yeung, M. Medvedovic, and R. Bumgarner, “Clustering Gene Expression Data with Repeated Measurements,” Genome Biology, vol. 4, p. R34, 2003.
[38] D. Zhu, A.O. Hero, H. Cheng, R. Khanna, and A. Swaroop, “Network Constrained Clustering for Gene Microarray Data,” Bioinformatics, vol. 21, no. 21, pp. 4014-4020, 2005.
[39] D. Zhu, A.O. Hero, Z.S. Qin, and A. Swaroop, “High Throughput Screening Co-Expressed Gene Pairs with Controlled Biological Significance and Statistical Significance,” J. Computational Biology, vol. 12, no. 7, pp. 1029-1045, 2005.
[40] D. Zhu, Y. Li, and H. Li, “Multivariate Correlation Estimator for Inferring Functional Relationships from Replicated Genome-Wide Data,” Bioinformatics, vol. 23, no. 17, pp. 2298-2305, 2007.
[41] D. Zhu and A.O. Hero, “Bayesian Hierarchical Model for Large-Scale Covariance Matrix Estimation,” J. Computational Biology, vol. 14, no. 10, pp. 1311-1326, 2007.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool