Subscribe
Issue No.05 - Sept.-Oct. (2012 vol.9)
pp: 1504-1514
Jianyong Sun , Centre for Plant Integrative Biol. (CPIB), Univ. of Nottingham, Nottingham, UK
Jonathan M. Garibaldi , Sch. of Comput. Sci., Univ. of Nottingham, Nottingham, UK
Kim Kenobi , Centre for Plant Integrative Biol. (CPIB), Univ. of Nottingham, Nottingham, UK
ABSTRACT
Experimental scientific data sets, especially biology data, usually contain replicated measurements. The replicated measurements for the same object are correlated, and this correlation must be carefully dealt with in scientific analysis. In this paper, we propose a robust Bayesian mixture model for clustering data sets with replicated measurements. The model aims not only to accurately cluster the data points taking the replicated measurements into consideration, but also to find the outliers (i.e., scattered objects) which are possibly required to be studied further. A tree-structured variational Bayes (VB) algorithm is developed to carry out model fitting. Experimental studies showed that our model compares favorably with the infinite Gaussian mixture model, while maintaining computational simplicity. We demonstrate the benefits of including the replicated measurements in the model, in terms of improved outlier detection rates in varying measurement uncertainty conditions. Finally, we apply the approach to clustering biological transcriptomics mRNA expression data sets with replicated measurements.
INDEX TERMS
trees (mathematics), Bayes methods, biology computing, Gaussian processes, genetics, molecular biophysics, RNA, biological transcriptomics mRNA expression data sets, robust Bayesian clustering, replicated gene expression data, experimental scientific data sets, biology data, scientific analysis, robust Bayesian mixture model, tree-structured variational Bayes algorithm, infinite Gaussian mixture model, computational simplicity, Clustering algorithms, Robustness, Biological system modeling, Bayesian methods, Tin, Approximation methods, Data models, gene expression data., Replicated measurement, clustering, robust clustering, outlier detection, variational Bayes
CITATION
Jianyong Sun, Jonathan M. Garibaldi, Kim Kenobi, "Robust Bayesian Clustering for Replicated Gene Expression Data", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 5, pp. 1504-1514, Sept.-Oct. 2012, doi:10.1109/TCBB.2012.85
REFERENCES
 [1] C. Archambeau, N. Delannay, and M. Verleysen, "Robust Probabilistic Projections," Proc. 23rd Int'l Conf. Machine Learning, pp. 33-40, June 2006. [2] C. Archambeau and M. Verleysen, "Robust Bayesian Clustering," Neural Networks, vol. 20, no. 1, pp. 129-138, Jan. 2007. [3] S. Asur, D. Ucar, and S. Parthasarathy, "An Ensemble Framework for Clustering Protein-Protein Interaction Networks," Bioinformatics, vol. 23, pp. 129-140, 2007. [4] S. Bandyopadhyay, A. Mukhopadhyay, and U. Maulik, "An Improved Algorithm for Clustering Gene Expression Data," Bioinformatics, vol. 23, no. 21, pp. 2859-2865, 2007. [5] A. Bhattacharya and R.K. De, "Bi-correlation Clustering Algorithm for Determining a Set Of Co-Regulated Genes," Bioinformatics, vol. 25, pp. 2795-2801, Nov. 2009. [6] C.M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [7] D.M. Blei and M.I. Jordan, "Varational Inference for Dirichlet Process Mixtures," Bayesian Analysis, vol. 1, no. 1, pp. 121-144, 2006. [8] I. Blilou, J. Xu, M. Wildwater, I. Paponov, R. Heidstra, M. Aida, K. Palme, and B. Scheres, "The PIN Auxin Efflux Facilitator Network Controls Growth and Patterning in Arabidopsis Roots," Nature, vol. 433, pp. 39-44, 2005. [9] D.P. Brown, "Efficient Functional Clustering of Protein Sequence Using the Dirichlet Process," Bioinformatics, vol. 24, no. 16, pp. 1765-1771, 2008. [10] P. Carbonetto, M. King, and F. Hamze, "A Stochastic Approximation Method for Inference in Probabilistic Graphical Models," Proc. Neural Information Processing Systems Foundation (NIPS), pp. 216-224, 2009. [11] G. Celeux, O. Martin, and C. Lavergne, "Mixture of Linear Mixed Models for Clustering Gene Expression Profiles from Repeated Microarray Experiments," Statistical Modelling, vol. 5, pp. 243-267, 2005. [12] D.L. Davies and D.W. Bouldin, "A Cluster Separation Measure," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224-227, Apr. 1979. [13] D. Dotan-Cohen, S. Kasif, and A.A. Melkman, "Seeing the Forest for the Trees: Using the Gene Ontology to Restructure Hierarchical Clustering," Bioinformatics, vol. 25, no. 14, pp. 1789-1795, 2009. [14] T. Fawcett, "ROC Graphs: Notes and Practical Considerations for Researchers," Technical Report HPL-2003-4, HP Labs, 2003. [15] M.A.T. Figueiredo and A.K. Jain, "Unsupervised Learning of Finite Mixture Models," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 381-396, Mar. 2002. [16] C. Fox and S. Roberts, "A Tutorial on Variational Bayesian Inference," Artificial Intelligence Rev., publish online, 15 June 2011, doi:10.1007/s10462-011-9236-8. [17] C. Fraley and A.E. Raftery, "Model-Based Clustering, Discriminant Analysis, and Density Estimation," J. Am. Statistical Assoc., vol. 97, pp. 611-631, 2002. [18] T.J. Guilfoyle and G. Hagen, "Auxin Response Factors," Current Opinions in Plant Biology, vol. 10, no. 5, pp. 453-460, Oct. 2007. [19] W.G. Hopkins, "A New View of Statistics," www.sportsci.org/resourcestats/, 2011. [20] L. Hubert and P. Arabie, "Comparing Partitions," J. Classification, vol. 2, pp. 193-218, 1985. [21] T.R. Hughes, M.J. Marton, C.J. Jones, A.R. Roberts, R. Stoughton, C.D. Armour, H.A. Bennett, E. Coffey, and Y.D. He, "Functional Discovery via a Compendium of Expression Profiles," Cell, vol. 102, pp. 109-126, 2000. [22] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul, "An Introduction to Variational Methods for Graphical Models," Machine Learning, vol. 37, no. 2, pp. 183-233, Nov. 1999. [23] S. Kim, J. Kim, and K.H. Cho, "Inferring Gene Regulatory Networks from Temporal Expression Profiles Under Time-Delay and Noise," Computational Biology and Chemistry, vol. 31, no. 4, pp. 239-245, 2007. [24] E.M. Kramer and M.J. Bennett, "Auxin Transport: A Field in Flux," Trends in Plant Science, vol. 11, no. 8, pp. 382-386, 2006. [25] S.R. Lipsitz, N.M. Laird, and D.P. Harrington, "Weighted Least Square Analysis of Repeated Categorical Measurements with Outcomes Subject to Nonresponse," Biometric, vol. 50, pp. 11-24, 1994. [26] Y. Loewenstein, E. Portugaly, M. Fromer, and M. Linial, "Efficient Algorithms for Accurate Hierarchical Clustering of Huge Data Sets: Tackling the Entire Protein Space," Bioinformatics, vol. 24, pp. 141-149, 2010. [27] M. Mdevedovic, K.Y. Yeung, and R.E. Bumgarner, "Bayesian Mixture Model Based Clustering of Replicated Expression Data," Bioinformatics, vol. 8, pp. 1222-1232, 2004. [28] Y. Okushima, H. Fukaki, M. Onoda, A. Theologis, and M. Tasaka, "ARF7 and ARF19 Regulate Lateral Root Formation via Direct Activation of LBD/ASL Genes in Rabidopsis," The Plant Cell, vol. 19, pp. 118-130, 2007. [29] M.K. Pakhira, S. Bandyopadhyay, and U. Maulik, "A Study of Some Fuzzy Cluster Validity Indices, Genetic Clustering and Application to Pixel Classification," Fuzzy Set and Systems, vol. 15, no. 2, pp. 191-214, Oct. 2005. [30] R.K. Pearson, G.E. Gonye, and J.S. Schwaber, "Biomedical and Life Sciences," Outliers in Microarray Data Analysis, ch. 1., pp. 41-55, Springer, 2004. [31] D. Peel and G.J. McLachlan, "Robust Mixture Modelling Using the t Distribution," Statistics and Computing, vol. 10, no. 4, pp. 339-348, Oct. 2000. [32] S. Persson, H. Wei, J. Milne, G.P. Page, and C.R. Somerville, "Identification of Genes Required for Cellulose Synthase by Regression Analysis of Public Microarray Data Sets," Proc Nat'l Academy of Sciences USA, vol. 102, no. 24, pp. 8633-8638, June 2005. [33] Z.S. Qin, "Clustering Microarray Gene Expression Data Using Weighted Chinese Restaurant Process," Bioinformatics, vol. 22, no. 16, pp. 1988-1997, 2006. [34] D. Ridder, F. Staal, J.M. Dogen van, and M.J. Reinders, "Maximum Significance Clustering of Oligonucleotide Microarrays," Bioinformatics, vol. 22, pp. 326-331, 2006. [35] M. Salicrú, S. Vives, and T. Zheng, "Inferential Clustering Approach for Microarray Experiments with Replicated Measurements," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 4, pp. 594-604, Oct.-Dec. 2009. [36] R.S. Savage, K. Heller, Y. Xu, Z. Ghahramani, W.M. Truman, M. Grant, K.J. Denby, and D.L. Wild, "R/HBC: Fast Bayesian Hierarchical Clusetring for Microarray Data," BMC Bioinformatics, vol. 10, article 242, Aug. 2009. [37] A. Shama, R. Podolsky, J. Zhao, and R.A. Mclndoe, "A Modified Hyperplane Clustering Algorithm Allows for Efficient and Accurate Clustering of Extremely Large Data Sets,," Bioinformatics, vol. 25, no. 9, pp. 1152-1157, 2009. [38] R. Shen, A.B. Olshen, and M. Ladanyi, "Integrative Clustering of Multiple Genomic Data Types Using a Joint Latent Variable Model with Application to Breast and Lung Cancer Sybtype Analysis," Bioinformatics, vol. 25, no. 22, pp. 2906-2912, 2009. [39] J. Sun, A. Kabán, and S. Raychaudhury, "Robust Mixtures in the Presence of Measurement Errors," Proc. 24th Int'l Conf. Machine Learning, pp. 847-854, Oct. 2007. [40] M. Svensen and C.M. Bishop, "Robust Bayesian Mixture Modelling," Neurocomputing, vol. 64, pp. 235-252, Mar. 2005. [41] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, "Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation," Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 2907-2912, 1999. [42] A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G.C. Tseng, "Evaluation and Comparison of Gene Clustering Methods in Microarray Analysis," Bioinformatics, vol. 22, pp. 2405-2412, Oct. 2006. [43] N.H. Timm, The General MANOVA Model (GMANOVA), ch. 3.6. Springer-Verlag, 2002. [44] G.C. Tseng, "Penalized and Weighted K-Means for Clustering with Scattered Objects and Prior Information in High-Throughput Biological Data," Bioinformatics, vol. 23, no. 17, pp. 2247-2255, 2007. [45] A. Vellido, P.J.G. Lisboa, and D. Vicente, "Handling Outliers and Missing Data in Brain Tumor Clinical Assessment Using $t$ -GTM," Computers in Biology and Medicine, vol. 36, no. 10, pp. 1049-1063, Oct. 2005. [46] M. Wainwright and M. Jordan, "Graphical Models, Exponential Families, and Variational Inference," Foundations and Trends in Machine Learning, vol. 1, nos. 1/2, pp. 1-305, Jan. 2008. [47] H. Wei, S. Persson, T. Mehta, V. Srinivasasainagendra, L. Chen, G.P. Page, C. Somerville, and A. Loraine, "Transcriptional Coordination of the Metabolic Network in Arabidopsis," Plant Physiology, vol. 142, no. 2, pp. 762-774, Oct. 2006. [48] B. Xie, W. Pan, and X. Shen, "Penalized Mixtures of Factor Analyzers with Application to Clustering High-Dimensional Microarray Data," Bioinformatics, vol. 26, no. 4, pp. 501-508, 2010. [49] K.Y. Yeung, M. Medvedovic, and R.E. Bumgarner, "Clustering Gene-Expression Data with Repeated Measurements," Genome Biology, vol. 4, no. 5, pp. R34.1-R34.17, 2003. [50] Z. Yu, H.-S. Wong, and H. Wang, "Graph-Based Consensus Clustering for Class Discovery from Gene Expression Data," Bioinformatics, vol. 23, no. 21, pp. 2888-2896, 2007.