This Article 
 Bibliographic References 
 Add to: 
Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles
July-September 2010 (vol. 7 no. 3)
pp. 412-420
Hong-Jie Dai, National Tsing-Hua University, Hsinchu, Taiwan
Po-Ting Lai, Yuan Ze University, Ching-Li, Taiwan
Richard Tzong-Han Tsai, Yuan Ze University, Ching-Li, Taiwan
The interactor normalization task (INT) is to identify genes that play the interactor role in protein-protein interactions (PPIs), to map these genes to unique IDs, and to rank them according to their normalized confidence. INT has two subtasks: gene normalization (GN) and interactor ranking. The main difficulties of INT GN are identifying genes across species and using full papers instead of abstracts. To tackle these problems, we developed a multistage GN algorithm and a ranking method, which exploit information in different parts of a paper. Our system achieved a promising AUC of 0.43471. Using the multistage GN algorithm, we have been able to improve system performance (AUC) by 1.719 percent compared to a one-stage GN algorithm. Our experimental results also show that with full text, versus abstract only, INT AUC performance was 22.6 percent higher.

[1] T. Zhang et al., "Zinc Finger Transcription Factor INSM1 Interrupts Cyclin D1 and CDK4 Binding and Induces Cell Cycle Arrest," J. Biological Chemistry, vol. 284, pp. 5574-5581, 2009.
[2] S. Thalappilly et al., "Identification of Multi-SH3 Domain-Containing Protein Interactome in Pancreatic Cancer: A Yeast Two-Hybrid Approach," Proteomics, vol. 8, pp. 3071-3081, 2008.
[3] L. Liao et al., "Shotgun Proteomics in Neuroscience," Neuron, vol. 63, pp. 12-26, 2009.
[4] A. Ceol et al., "Linking Entries in Protein Interaction Database to Structured Text: The FEBS Letters Experiment," FEBS Letters, vol. 582, pp. 1171-1177, 2008.
[5] M. Seringhaus and M. Gerstein, "Manually Structured Digital Abstracts: A Scaffold for Automatic Text Mining," FEBS Letters, vol. 582, no. 8, p. 1170, 2008, .
[6] L. Smith et al., "Overview of BioCreative II Gene Mention Recognition," Genome Biology, vol. 9(Suppl 2): S2, 2008, .
[7] L. Hirschman et al., "Overview of BioCreAtIvE Task 1B: Normalized Gene Lists," BMC Bioinformatics, vol. 6(Suppl 2): S11, 2005, .
[8] C. Wu et al., "BioTagger: A Biological Entity Tagging System," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 28-31, 2004.
[9] K. Fundel et al., "Exact versus Approximate String Matching for Protein Name Identication," Proc. BioCreative Challenge Evaluation Workshop, 2004.
[10] J. Hakenberg et al., "Me and My Friends: Gene Mention Normalization with Background Knowledge," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 23-25, 2007.
[11] M. Krallinger et al., "Overview of the Protein-Protein Interaction Annotation Extraction Task of BioCreative II," Genome Biology, vol. 9(Suppl 2): S4, 2008, .
[12] P.K. Shah et al., "Information Extraction from Full Text Scientific Articles: Where Are the Keywords?" BMC Bioinformatics, vol. 4, article no. 20, 2003, .
[13] M.J. Schuemie et al., "Distribution of Information in Biomedical Abstracts and Full-Text Publications," Bioinformatics, vol. 20, pp. 2597-2604, 2004.
[14] J.M. Eales et al., "Full-Text Mining: Linking Practice, Protocols and Articles in Biological Research," Proc. 16th Ann. Int'l Conf. Intelligent Systems for Molecular Biology (ISMB), 2008.
[15] W.A. Baumgartner et al., "An Integrated Approach to Concept Recognition in Biomedical Text," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 257-271, 2007.
[16] A.S. Schwartz and M.A. Hearst, "A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text," Proc. Pacific Symp. Biocomputing, pp. 451-462, 2003.
[17] H.-J. Dai et al., "IASL Systems in the Gene Mention Tagging Task and Protein Interaction Article Sub-Task," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 69-76, 2007.
[18] J. Lafferty et al., "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. Int'l Conf. Machine Learning (ICML), 2001.
[19] R.T.-H. Tsai et al., "NERBio: Using Selected Word Conjunctions, Term Normalization, and Global Patterns to Improve Biomedical Named Entity Recognition," BMC Bioinformatics, vol. 7 (Suppl 5): S11, 2006, .
[20] P.-T. Lai et al., "Using Contextual Information to Clarify Gene Normalization Ambiguity," Proc. IEEE Int'l Conf. Information Reuse and Integration (IEEE IRI '09), 2009.
[21] P. Romano et al., "Cell Line Data Base: Structure and Recent Improvements Towards Molecular Authentication of Human Cell Lines," Nucleic Acids Research, vol. 37, pp. D925-D932, 2009.
[22] New England BioLabs, Inc.,, 2010.
[23] RIKEN Bioresource Center: Cell Bank, index.shtml, 2010.
[24] HyperCLDB,, 2010.
[25] Invitrogen,, 2010.
[26] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[27] T. McIntosh and J.R. Curran, "Challenges for Extracting Biomedical Knowledge from Full Text," Proc. Human Language Technology Conf. (BioNLP '07), 2007.
[28] J.M. Swales, Genre Analysis: English in Academic and Research Settings. Cambridge Univ. Press, 1990.
[29] J.M. Swales, Research Genres: Explorations and Applications. Cambridge Univ. Press, 2004.
[30] Y. Regev et al., "Rule-Based Extraction of Experimental Evidence in the Biomedical Domain: The KDD Cup 2002 (Task 1)," ACM SIGKDD Explorations Newsletter, vol. 4, pp. 90-92, 2002.
[31] H. Shatkay et al., "Integrating Image Data into Biomedical Text Categorization," Bioinformatics, vol. 22, pp. e446-e453, July 2006.
[32] C.D. Paice, "The Automatic Generation of Literature Abstracts: An Approach Based on the Identification of Self-Indicating Phrases," Proc. Third Ann. ACM Conf. Research and Development in Information Retrieval, 1981.
[33] G. Myers, "'In This Paper We Report…': Speech Acts and Scientific Facts," J. Pragmatics, vol. 17, pp. 295-313, 1992.
[34] C.D. Paice, Information Retrieval and the Computer. Macdonald and Jane's, 1977.
[35] M. Huang et al., "Discovering Patterns to Extract Protein-Protein Interactions from Full Texts," Bioinformatics, vol. 20, pp. 3604-3612, Dec. 2004.
[36] T.F. Smith and M. Waterman, "Identification of Common Molecular Subsequences," J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[37] R.T.-H. Tsai et al., "HypertenGene: Extracting Key Hypertension Genes from Biomedical Literature with Position and Automatically-Generated Template Features," Proc. Eighth InCoB—Seventh Int'l Conf. Bioinformatics, 2009.
[38] R.T.-H. Tsai et al., "On Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching," Proc. Fifth SIGHAN Workshop Chinese Language Processing, pp. 108-117, 2006.
[39] C.-L. Sung et al., "Alignment-Based Surface Patterns for Factoid Question Answering Systems," Integrated Computer-Aided Eng., vol. 16, pp. 259-269, Aug. 2009.
[40] F. Leitner et al., "Introducing Meta-Services for Biomedical Information Extraction," Genome Biology, vol. 9 (Suppl 2): S6, 2008, .
[41] Y. Wang et al., "PubChem: A Public Information System for Analyzing Bioactivities of Small Molecules," Nucleic Acids Research, vol. 37, pp. W623-W633, July 2009.
[42] A.P. Bradley, "The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms," Pattern Recognition, vol. 30, pp. 1145-1159, 1997.
[43] K. Verspoor et al., "Information Extraction of Normalized Protein Interaction Pairs Utilizing Linguistic and Semantic Cues," Proc. BioCreative II.5 Workshop 2009 Digital Annotations, p. 37, 2009.

Index Terms:
Data mining, feature evaluation and selection, mining methods and algorithms, text mining, scientific databases.
Hong-Jie Dai, Po-Ting Lai, Richard Tzong-Han Tsai, "Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 412-420, July-Sept. 2010, doi:10.1109/TCBB.2010.45
Usage of this product signifies your acceptance of the Terms of Use.