The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - July-September (2010 vol.7)
pp: 412-420
Hong-Jie Dai , National Tsing-Hua University, Hsinchu, Taiwan
Po-Ting Lai , Yuan Ze University, Ching-Li, Taiwan
Richard Tzong-Han Tsai , Yuan Ze University, Ching-Li, Taiwan
ABSTRACT
The interactor normalization task (INT) is to identify genes that play the interactor role in protein-protein interactions (PPIs), to map these genes to unique IDs, and to rank them according to their normalized confidence. INT has two subtasks: gene normalization (GN) and interactor ranking. The main difficulties of INT GN are identifying genes across species and using full papers instead of abstracts. To tackle these problems, we developed a multistage GN algorithm and a ranking method, which exploit information in different parts of a paper. Our system achieved a promising AUC of 0.43471. Using the multistage GN algorithm, we have been able to improve system performance (AUC) by 1.719 percent compared to a one-stage GN algorithm. Our experimental results also show that with full text, versus abstract only, INT AUC performance was 22.6 percent higher.
INDEX TERMS
Data mining, feature evaluation and selection, mining methods and algorithms, text mining, scientific databases.
CITATION
Hong-Jie Dai, Po-Ting Lai, Richard Tzong-Han Tsai, "Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 3, pp. 412-420, July-September 2010, doi:10.1109/TCBB.2010.45
REFERENCES
[1] T. Zhang et al., "Zinc Finger Transcription Factor INSM1 Interrupts Cyclin D1 and CDK4 Binding and Induces Cell Cycle Arrest," J. Biological Chemistry, vol. 284, pp. 5574-5581, 2009.
[2] S. Thalappilly et al., "Identification of Multi-SH3 Domain-Containing Protein Interactome in Pancreatic Cancer: A Yeast Two-Hybrid Approach," Proteomics, vol. 8, pp. 3071-3081, 2008.
[3] L. Liao et al., "Shotgun Proteomics in Neuroscience," Neuron, vol. 63, pp. 12-26, 2009.
[4] A. Ceol et al., "Linking Entries in Protein Interaction Database to Structured Text: The FEBS Letters Experiment," FEBS Letters, vol. 582, pp. 1171-1177, 2008.
[5] M. Seringhaus and M. Gerstein, "Manually Structured Digital Abstracts: A Scaffold for Automatic Text Mining," FEBS Letters, vol. 582, no. 8, p. 1170, 2008, http://dx.doi.org/10.1016j.febslet.2008.02.073 .
[6] L. Smith et al., "Overview of BioCreative II Gene Mention Recognition," Genome Biology, vol. 9(Suppl 2): S2, 2008, http://dx.doi.org/10.1186gb-2008-9-s2-s2 .
[7] L. Hirschman et al., "Overview of BioCreAtIvE Task 1B: Normalized Gene Lists," BMC Bioinformatics, vol. 6(Suppl 2): S11, 2005, http://dx.doi.org/10.11861471-2105-6-S1-S11 .
[8] C. Wu et al., "BioTagger: A Biological Entity Tagging System," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 28-31, 2004.
[9] K. Fundel et al., "Exact versus Approximate String Matching for Protein Name Identication," Proc. BioCreative Challenge Evaluation Workshop, 2004.
[10] J. Hakenberg et al., "Me and My Friends: Gene Mention Normalization with Background Knowledge," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 23-25, 2007.
[11] M. Krallinger et al., "Overview of the Protein-Protein Interaction Annotation Extraction Task of BioCreative II," Genome Biology, vol. 9(Suppl 2): S4, 2008, http://dx.doi.org/10.1186gb-2008-9-s2-s4 .
[12] P.K. Shah et al., "Information Extraction from Full Text Scientific Articles: Where Are the Keywords?" BMC Bioinformatics, vol. 4, article no. 20, 2003, http://dx.doi.org/doi:10.11861471-2105-4-20 .
[13] M.J. Schuemie et al., "Distribution of Information in Biomedical Abstracts and Full-Text Publications," Bioinformatics, vol. 20, pp. 2597-2604, 2004.
[14] J.M. Eales et al., "Full-Text Mining: Linking Practice, Protocols and Articles in Biological Research," Proc. 16th Ann. Int'l Conf. Intelligent Systems for Molecular Biology (ISMB), 2008.
[15] W.A. Baumgartner et al., "An Integrated Approach to Concept Recognition in Biomedical Text," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 257-271, 2007.
[16] A.S. Schwartz and M.A. Hearst, "A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text," Proc. Pacific Symp. Biocomputing, pp. 451-462, 2003.
[17] H.-J. Dai et al., "IASL Systems in the Gene Mention Tagging Task and Protein Interaction Article Sub-Task," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 69-76, 2007.
[18] J. Lafferty et al., "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. Int'l Conf. Machine Learning (ICML), 2001.
[19] R.T.-H. Tsai et al., "NERBio: Using Selected Word Conjunctions, Term Normalization, and Global Patterns to Improve Biomedical Named Entity Recognition," BMC Bioinformatics, vol. 7 (Suppl 5): S11, 2006, http://dx.doi.org/doi:10.11861471-2105-7-S5-S11 .
[20] P.-T. Lai et al., "Using Contextual Information to Clarify Gene Normalization Ambiguity," Proc. IEEE Int'l Conf. Information Reuse and Integration (IEEE IRI '09), 2009.
[21] P. Romano et al., "Cell Line Data Base: Structure and Recent Improvements Towards Molecular Authentication of Human Cell Lines," Nucleic Acids Research, vol. 37, pp. D925-D932, 2009.
[22] New England BioLabs, Inc., http://www.neb.com/nebecomm/productscategory1.asp?#2, 2010.
[23] RIKEN Bioresource Center: Cell Bank, http://www.brc.riken.jp/lab/cell/english index.shtml, 2010.
[24] HyperCLDB, http://bioinformatics.istge.it/cldbindexes.html, 2010.
[25] Invitrogen, http://www.invitrogen.com/site/us/enhome.html, 2010.
[26] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[27] T. McIntosh and J.R. Curran, "Challenges for Extracting Biomedical Knowledge from Full Text," Proc. Human Language Technology Conf. (BioNLP '07), 2007.
[28] J.M. Swales, Genre Analysis: English in Academic and Research Settings. Cambridge Univ. Press, 1990.
[29] J.M. Swales, Research Genres: Explorations and Applications. Cambridge Univ. Press, 2004.
[30] Y. Regev et al., "Rule-Based Extraction of Experimental Evidence in the Biomedical Domain: The KDD Cup 2002 (Task 1)," ACM SIGKDD Explorations Newsletter, vol. 4, pp. 90-92, 2002.
[31] H. Shatkay et al., "Integrating Image Data into Biomedical Text Categorization," Bioinformatics, vol. 22, pp. e446-e453, July 2006.
[32] C.D. Paice, "The Automatic Generation of Literature Abstracts: An Approach Based on the Identification of Self-Indicating Phrases," Proc. Third Ann. ACM Conf. Research and Development in Information Retrieval, 1981.
[33] G. Myers, "'In This Paper We Report…': Speech Acts and Scientific Facts," J. Pragmatics, vol. 17, pp. 295-313, 1992.
[34] C.D. Paice, Information Retrieval and the Computer. Macdonald and Jane's, 1977.
[35] M. Huang et al., "Discovering Patterns to Extract Protein-Protein Interactions from Full Texts," Bioinformatics, vol. 20, pp. 3604-3612, Dec. 2004.
[36] T.F. Smith and M. Waterman, "Identification of Common Molecular Subsequences," J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[37] R.T.-H. Tsai et al., "HypertenGene: Extracting Key Hypertension Genes from Biomedical Literature with Position and Automatically-Generated Template Features," Proc. Eighth InCoB—Seventh Int'l Conf. Bioinformatics, 2009.
[38] R.T.-H. Tsai et al., "On Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching," Proc. Fifth SIGHAN Workshop Chinese Language Processing, pp. 108-117, 2006.
[39] C.-L. Sung et al., "Alignment-Based Surface Patterns for Factoid Question Answering Systems," Integrated Computer-Aided Eng., vol. 16, pp. 259-269, Aug. 2009.
[40] F. Leitner et al., "Introducing Meta-Services for Biomedical Information Extraction," Genome Biology, vol. 9 (Suppl 2): S6, 2008, http://dx.doi.org/doi:10.1186gb-2008-9-s2-s6 .
[41] Y. Wang et al., "PubChem: A Public Information System for Analyzing Bioactivities of Small Molecules," Nucleic Acids Research, vol. 37, pp. W623-W633, July 2009.
[42] A.P. Bradley, "The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms," Pattern Recognition, vol. 30, pp. 1145-1159, 1997.
[43] K. Verspoor et al., "Information Extraction of Normalized Protein Interaction Pairs Utilizing Linguistic and Semantic Cues," Proc. BioCreative II.5 Workshop 2009 Digital Annotations, p. 37, 2009.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool