This Article 
 Bibliographic References 
 Add to: 
Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis
January-March 2010 (vol. 7 no. 1)
pp. 91-99
Bogdan Done, Wayne State University, Detriot
Purvesh Khatri, Wayne State University, Detriot
Arina Done, Wayne State University, Detriot
Sorin Drăghici, Wayne State University, Detriot
The correct interpretation of many molecular biology experiments depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are meant to act as repositories for our biological knowledge as we acquire and refine it. Hence, by definition, they are incomplete at any given time. In this paper, we describe a technique that improves our previous method for predicting novel GO annotations by extracting implicit semantic relationships between genes and functions. In this work, we use a vector space model and a number of weighting schemes in addition to our previous latent semantic indexing approach. The technique described here is able to take into consideration the hierarchical structure of the Gene Ontology (GO) and can weight differently GO terms situated at different depths. The prediction abilities of 15 different weighting schemes are compared and evaluated. Nine such schemes were previously used in other problem domains, while six of them are introduced in this paper. The best weighting scheme was a novel scheme, n2tn. Out of the top 50 functional annotations predicted using this weighting scheme, we found support in the literature for 84 percent of them, while 6 percent of the predictions were contradicted by the existing literature. For the remaining 10 percent, we did not find any relevant publications to confirm or contradict the predictions. The n2tn weighting scheme also outperformed the simple binary scheme used in our previous approach.

[1] J.M. Abasolo and M. Gomez, "MELISA: An Ontology-Based Agent for Information Retrieval in Medicine," Proc. First Int'l Workshop on the Semantic Web, 2000.
[2] F. Al-Shahrour, R. Diaz-Uriarte, and J. Dopazo, "FatiGO: A Web Tool for Finding Significant Associations of Gene Ontology Terms with Groups of Genes," Bioinformatics, vol. 20, no. 4, pp. 578-580, 2004.
[3] O. Alter, P.O. Brown, and D. Botstein, "Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling," Proc. Nat'l Academy of Sciences USA, vol. 97, no. 18, pp. 10101-10106, 2000.
[4] T. Beissbarth and T.P. Speed, "GOstat: Find Statistically OverRepresented Gene Ontologies within a Group of Genes," Bioinformatics, vol. 20, pp. 1464-1465, June 2004.
[5] M.W. Berry, Z. Drmac, and E.R. Jessup, "Matrices, Vector Spaces, and Information Retrieval," SIAM Rev., vol. 41, no. 2, pp. 335-362, 1999.
[6] M.W. Berry, S.T. Dumais, and G.W. O'Brien, "Using Linear Algebra for Intelligent Information Retrieval," SIAM Rev., vol. 37, no. 4, pp. 573-595, 1995.
[7] O. Bodenreider, M. Aubry, and A. Burgun, "Evaluation of the Vector Space Representation in Text-Based Gene Clustering," Proc. Pacific Symp. Biocomputing (PSB '05), pp. 91-102, 2005.
[8] M.P.S. Brown, W. Boble Grundy, D. Lin, N. Cristianini, C. Waish Sugnet, T.S. Furgey, M. Ares Manuel, and D. Haussler, "Knowledge-Based Analysis of Microarray Gene Expression Data by Using Support Vector Machines," Proc. Nat'l Academy of Sciences USA, vol. 97, no. 1, pp. 262-267, 2000.
[9] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," J. Am. Soc. for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[10] D. Devos and A. Valencia, "Practical Limits of Function Prediction," PROTEINS: Structure, Function, and Genetics, vol. 41, pp. 98-107, 2000.
[11] S. Drăghici, P. Khatri, P. Bhavsar, A. Shah, S.A. Krawetz, and M.A. Tainsky, "Onto-Tools, the Toolkit of the Modern Biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate," Nucleic Acids Research, vol. 31, no. 13, pp. 3775-3781, July 2003.
[12] S. Drăghici, P. Khatri, R.P. Martins, G. Charles Ostermeier, and S.A. Krawetz, "Global Functional Profiling of Gene Expression," Genomics, vol. 81, no. 2, pp. 98-104, Feb. 2003.
[13] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, "Cluster Analysis and Display of Genome-Wide Expression Patterns," Proc. Nat'l Academy of Sciences USA, vol. 95, no. 25, pp. 14863-14868, Dec. 1998.
[14] J.S. Fetrow, N. Siew, J.A. Di Gennaro, M. Martinez-Yamout, H.J. Dyson, and J. Skolnick, "Genomic-Scale Comparison of Sequence- and Structure-Based Methods of Function Prediction: Does Structure Provide Additional Insight?" Protein Science, vol. 10, pp. 1005-1014, 2001.
[15] S.C. Geller, J.P. Gregg, P. Hagerman, and D.M. Rocke, "Transformation and Normalization of Oligonucleotide Microarray Data," Bioinformatics, vol. 19, pp. 1817-1823, 2003.
[16] S. Gerard, Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[17] P. Glenisson, P. Antal, J. Mathys, Y. Moreau, and B. De Moor, "Evaluation of the Vector Space Representation in Text-Based Gene Clustering," Proc. Pacific Symp. Biocomputing (PSB '03), pp. 391-402, 2003.
[18] G. Golub and C.F. van Loan, Matrix Computations. Johns Hopkins Univ. Press, 1983.
[19] J. Herrero, R. Diaz-Uriarte, and J. Dopazo, "Gene Expression Data Processing," Bioinformatics, vol. 19, pp. 655-656, 2003.
[20] R. Homayouni, K. Heinrich, L. Wei, and M.W. Berry, "Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts," Bioinformatics, vol. 21, no. 1, pp. 104-115, 2005.
[21] D.A. Hosack, G. Dennis Jr., B.T. Sherman, H. Clifford Lane, and R.A. Lempicki, "Identifying Biological Themes within Lists of Genes with EASE," Genome Biology, vol. 4, no. 6, p. P4, 2003.
[22] T.R. Hvidsten, A.K. Sandvik, A. Laegreid, and J. Komorowski, "Predictive Gene Function from Gene Expressions and Ontologies," Proc. Pacific Symp. Biocomputing (PSB), 2001.
[23] P.D. Karp, "What We Do Not Know About Sequence Analysis and Sequence Databases," Bioinformatics, vol. 14, no. 9, pp. 753-754, 1998.
[24] P. Khatri, P. Bhavsar, G. Bawa, and S. Drăghici, "Onto-Tools: An Ensemble of Web-Accessible, Ontology-Based Tools for the Functional Design and Interpretation of High-Throughput Gene Expression Experiments," Nucleic Acids Research, vol. 32, pp. W449-W456, July 2004.
[25] P. Khatri, B. Done, A. Rao, A. Done, and S. Draghici, "A Semantic Analysis of the Annotations of the Human Genome," Bioinformatics, vol. 21, no. 16, pp. 3416-3421, 2005.
[26] P. Khatri and S. Draghici, "Ontological Analysis of Gene Expression Data: Current Tools, Limitations, and Open Problems," Bioinformatics, vol. 21, no. 18, pp. 3587-3595, 2005.
[27] P. Khatri, S. Drăghici, G. Charles Ostermeier, and S.A. Krawetz, "Profiling Gene Expression Using Onto-Express," Genomics, vol. 79, no. 2, pp. 266-270, Feb. 2002.
[28] O.D. King, R.E. Foulger, S.S. Dwight, J.V. White, and F.P. Roth, "Predicting Gene Function from Patterns of Annotation," Genome Research, vol. 13, pp. 896-904, 2003.
[29] D. Kozono, M. Yasui, L.S. King, and P. Agre, "Aquaporin Water Channels: Atomic Structure Molecular Dynamics Meet Clinical Medicine," J. Clinical Investigation, vol. 109, no. 11, pp. 1395-1399, 2002.
[30] H.M. Muller, E.E. Kenny, and P.W. Sternberg, "Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature," PloS Biology, vol. 2, no. 11, pp. 1984-1998, 2004.
[31] W. Pan, J. Lin, and C. Le, "Model-Based Cluster Analysis of Microarray Gene Expression Data," Genome Biology, vol. 3, no. 2, pp. research0009.1-research0009.8, 2002.
[32] B. Popov, A. Kiryakov, D. Ognyanoff, D. Manov, and A. Kirilov, "Kim—A Semantic Platform for Information Extraction and Retrieval," Natural Language Eng., vol. 10, no. 3-4, pp. 375-392, 2004.
[33] J. Quackenbush, "Microarrays—Guilt by Association," Science, vol. 302, no. 5643, pp. 240-241, 2003.
[34] S. Raychaudhuri, J.T. Chang, P.D. Sutphin, and R.B. Altman, "Associating Genes with Gene Ontology Codes Using a Maximum Entropy Analysis of Biomedical Literature," Genome Research, vol. 12, pp. 203-214, 2002.
[35] J. Richardson, "Vlad: A New GO Tool for Visual Annotation Display,", 2007.
[36] K.G. Le Roch, Y. Zhou, P.L. Blair, M. Grainger, J.K. Moch, J. David Haynes, P. De la Vega, A.A. Holder, S. Batalov, D.J. Carucci, and E.A. Winzeler, "Discovery of Gene Function by Expression Profiling of the Malaria Parasite Life Cycle," Science, vol. 301, no. 5639, pp. 1503-1508, 2003.
[37] J. Schug, S. Diskin, J. Mazzarelli, B.P. Brunk, and C.J. Stoeckert Jr., "Predicting Gene Ontology Functions from PromDom and CDD Protein Domains," Genome Research, vol. 12, pp. 648-655, 2002.
[38] J. Skolnick and J.S. Fetrow, "From Genes to Protein Structure and Function: Novel Applications of Computational Approaches in the Genomic Era," Trends in Biotechnology, vol. 18, pp. 283-287, 2000.
[39] D. Vallet, M. Fernandez, and P. Castells, "An Ontology-Based Information Retrieval Model," Proc. Second European Semantic Web Conf. (ESWC '05), pp. 455-470, 2005.
[40] M.G. Walker, W. Volkmuth, E. Sprinzak, D. Hodgson, and T. Klingler, "Prediction of Gene Function by Genome-Scale Expression Analysis: Prostate Cancer-Associated Genes," Genome Research, vol. 9, no. 12, pp. 1198-1203, 1999.
[41] L.F. Wu, T.R. Hughes, A.P. Davierwala, M.D. Robinson, R. Stoughton, and S.J. Altschuler, "Large-Scale Prediction of Saccharomyces Cerevisiae Gene Function Using Overlapping Transcriptional Clusters," Nature Genetics, vol. 31, no. 3, pp. 255-265, July 2002.
[42] B.R. Zeeberg, W. Feng, G. Wang, M.D. Wang, A.T. Fojo, M. Sunshine, S. Narasimhan, D.W. Kane, W.C. Reinhold, S. Lababidi, K.J. Bussey, J. Riss, J. Carl Barrett, and J.N. Weinstein, "GoMiner: A Resource for Biological Interpretation of Genomic and Proteomic Data," Genome Biology, vol. 4, no. 4, p. R28, Mar. 2003.
[43] B. Zhang, D. Schmoyer, S. Kirvo, and J. Snoddy, "GOTree Machine (GOTM): A Web-Based Platform for Interpreting Sets of Interesting Genes Using Gene Ontology Hierarchies," BMC Bioinformatics, vol. 5, article 16, Feb. 2004.
[44] G.H. Zhou, X.Y. Wen, H. Liu, M.J. Schlicht, M.J. Hessner, P.J. Tonellato, and M.W. Datta, "BEAR GeneInfo: A Tool for Identifying Gene-Related Biomedical Publications through User Modifiable Queries," BMC Bioinformatics, vol. 5, article 46, Apr. 2004.

Index Terms:
Gene function prediction, gene annotation, Gene Ontology, vector space model, latent semantic indexing, weighting schemes.
Bogdan Done, Purvesh Khatri, Arina Done, Sorin Drăghici, "Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 91-99, Jan.-March 2010, doi:10.1109/TCBB.2008.29
Usage of this product signifies your acceptance of the Terms of Use.