This Article 
 Bibliographic References 
 Add to: 
Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic Topic Modeling
July-Aug. 2012 (vol. 9 no. 4)
pp. 980-991
T. Y. Lim, Coll. of Comput. & Inf. Eng., Henan Univ., Kaifeng, China
Xiaohua Hu, Coll. of Inf. Sci. & Technol., Drexel Univ., Philadelphia, PA, USA
Xin Chen, Coll. of Inf. Sci. & Technol., Drexel Univ., Philadelphia, PA, USA
Xiajiong Shen, California State Univ. - Chico, Chico, CA, USA
E. K. Park, Dept. of Electr. & Comput. Eng., Drexel Univ., Philadelphia, PA, USA
G. L. Rosen, Dept. of Electr. & Comput. Eng., Drexel Univ., Philadelphia, PA, USA
In this paper, we present a method that enable both homology-based approach and composition-based approach to further study the functional core (i.e., microbial core and gene core, correspondingly). In the proposed method, the identification of major functionality groups is achieved by generative topic modeling, which is able to extract useful information from unlabeled data. We first show that generative topic model can be used to model the taxon abundance information obtained by homology-based approach and study the microbial core. The model considers each sample as a "document,” which has a mixture of functional groups, while each functional group (also known as a "latent topic”) is a weight mixture of species. Therefore, estimating the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent topic) in each sample. Second, we show that, generative topic model can also be used to study the genome-level composition of "N-mer” features (DNA subreads obtained by composition-based approaches). The model consider each genome as a mixture of latten genetic patterns (latent topics), while each functional pattern is a weighted mixture of the "N-mer” features, thus the existence of core genomes can be indicated by a set of common N-mer features. After studying the mutual information between latent topics and gene regions, we provide an explanation of the functional roles of uncovered latten genetic patterns. The experimental results demonstrate the effectiveness of proposed method.

[1] M. Ashburner et al., "Gene Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium," Nature Genetics, vol. 25, no. 1, pp. 25-29, 2000.
[2] T. Aso and K. Eguchi, "Predicting Protein-Protein Relationships from Literature Using Latent Topics," Genome Inform, vol. 23, pp. 3-12, 2009.
[3] D. Blei, A. Ng, and M. Jordan, "Latent Dirichlet Allocation," J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[4] A. Brady, S.L. Salzberg, Phymm, and B.L. Phymm, "Metagenomic Phylogenetic Classification with Interpolated Markov Models," Nature Methods, vol. 6, no. 9, pp. 673-676, 2009.
[5] D.C. Richter and D.H. Huson, "Functional Metagenome Analysis Using Gene Ontology (MEGAN 4)," Talk at the SIG M3 Meeting (ISMB 2009), Stockholm. 2009.
[6] G. Ehrlich, N.L. Hiller, and F. Hu, "What Makes Pathogens Pathogenic," Genome Biology, vol. 9, p. 225, 2008.
[7] L. Fei-Fei and P. Perona, "A Bayesian Hierarchical Model for Learning Natural Scene Categories," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition (CVPR '05), 2005.
[8] P. Flaherty, G. Giaever, J. Kumm, M.I. Jordan, and A.P. Arkin, "A Latent Variable Model for Chemogenomic Profiling," Bioinformatics, vol. 21, pp. 3286-3293, 2005.
[9] V. Gadia and G. Rosen, "A Text-Mining Approach for Classification of Genomic Fragments," Proc. IEEE Int'l Workshop Biomedical and Health Informatics, Nov. 2008.
[10] G.W. Tannock, "The Bowel Microbiota and Inflammatory Bowel Diseases," Int'l J. Inflammation, vol. 2010, p. 9, 2010, doi: 10.4061/2010/954051.
[11] G.K. Gerber, R.D. Dowell, T.S. Jaakkola, D.K. Gifford, "Hierarchical Dirichlet Process-Based Models for Discovery of Cross-Species Mammalian Gene Expression," technical report, CSIL, MIT, 2007.
[12] T.L. Griffiths and M. Steyvers, "Finding Scientific Topics," Proc. Nat'l Academy of Sciences USA, vol. 101, pp. 5228-5235, 2004.
[13] S. Harry et al., "Specificities of the Fecal Microbiota in Inflammatory Bowel Disease," Inflammatory Bowel Diseases, vol. 12, no 2, pp. 106-111, Feb. 2006.
[14] R. Holland, T. Down, M. Pocock, and A. Prlic, "BioJava: An Open-Source Framework for Bioinformatics," Bioinformatics, vol. 24, no. 18, pp. btn397-btn2097, Aug. 2008.
[15] D. Huson, A. Auch, J. Qi, and S. Schuster, "MEGAN Analysis of Metagenomic Data," Genome Research, vol. 17, no. 3, pp. 377-386, 2007.
[16] D. Huson, D. Richter, S. Mitra, A. Auch, and S. Schuster, "Methods for Comparative Metagenomics," BMC Bioinformatics, vol. 10, no. Suppl 1., article S12, 2009.
[17] D. Knights, E.K. Costello, and R. Knight, "Supervised Classification of Human Microbiota," FEMS Microbiol Rev., vol. 35, no. 2, pp. 343-359, 2011.
[18] C. Manichanh et al., "Reduced Diversity of Faecal Microbiota in Crohn's Disease Revealed by a Meta-Genomic Approach," Gut, vol. 55, no. 2, pp. 205-211, doi: 10.1136/gut.2005.073817, Feb. 2006.
[19] D. Medini, "The Microbial Pan-Genome," Current Opinion in Genetics and Development, vol. 15, no. 6, pp. 589-594, 2005.
[20] J. Qin et al., "A Human Gut Microbial Gene Catalogue Established by Metagenomic Sequencing Nature," vol. 464, no. 7285, pp. 59-65, 2010.
[21] R. Caspi et al., "The MetaCyc Database of Metabolic Pathways and Enzymes and the BioCyc Collection of Pathway/Genome Databases," Nucleic Acids Research, vol. 38, pp. D473-D479, 2010.
[22] G. Rosen, E. Garbarine, D. Caseiro, R. Polikar, and B. Sokhansanj, "Metagenome Fragment Classification Using N-Mer Frequency Profiles," Advances in Bioinformatics, vol. 2008, p. 12, Sept. 2008.
[23] G. Rosen, B. Sokhansanj, R. Polikar, M.A. Bruns, J. Russell, E. Garbarine, S. Essinger, and N. Yok, "Signal Processing for Metagenomics: Extracting Information from the Soup," Current Genomics, vol. 10, pp. 493-510, Nov. 2009.
[24] E.B. Sudderth, A. Torralba, W.T. Freeman, and A.S. Wilsky, "Learning Hierarchical Models of Scences, Objects, and Parts," Proc. Int'l Conf. Computer Vision, 2005.
[25] A. Walker et al., "J High-Throughput Clone Library Analysis of the Mucosa-Associated Microbiota Reveals Dysbiosis and Differences between Inflamed and Non-Inflamed Regions of the Intestine in Inflammatory Bowel Disease," BMC Microbiology, vol. 11, article 7, 2011.
[26] Willenbrock et al., "Characterization of Probiotic Escherichia Coli Isolates with a Novel Pan-Genome Microarray," Genome Biology, vol. 8, p. 12, 2007.
[27] B. Zheng, D.C. Mclean, and X. Lu, "Identifying Biological Concepts from a Protein-Related Corpus with a Probabilistic Topic Model," BMC Bioinformatics, vol. 7, article 58, 2006.

Index Terms:
probability,bioinformatics,data mining,DNA,genetics,genomics,molecular biophysics,data mining,functional structure,taxonomic structure,genomic data,probabilistic topic modeling,homology-based approach,composition-based approach,microbial core,gene core,generative topic modeling,taxon abundance information,genome-level composition,DNA subreads,latten genetic patterns,bioinformatics,Bioinformatics,Genomics,Strain,Databases,Data models,DNA,Integrated circuit modeling,metagenomics.,Data mining,bioinformatics (genome or protein) databases,language models
T. Y. Lim, Xiaohua Hu, Xin Chen, Xiajiong Shen, E. K. Park, G. L. Rosen, "Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic Topic Modeling," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 980-991, July-Aug. 2012, doi:10.1109/TCBB.2011.113
Usage of this product signifies your acceptance of the Terms of Use.