Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles
July-September 2010 (vol. 7 no. 3)
pp. 412-420
Hong-Jie Dai, National Tsing-Hua University, Hsinchu, Taiwan
Po-Ting Lai, Yuan Ze University, Ching-Li, Taiwan
Richard Tzong-Han Tsai, Yuan Ze University, Ching-Li, Taiwan
The interactor normalization task (INT) is to identify genes that play the interactor role in protein-protein interactions (PPIs), to map these genes to unique IDs, and to rank them according to their normalized confidence. INT has two subtasks: gene normalization (GN) and interactor ranking. The main difficulties of INT GN are identifying genes across species and using full papers instead of abstracts. To tackle these problems, we developed a multistage GN algorithm and a ranking method, which exploit information in different parts of a paper. Our system achieved a promising AUC of 0.43471. Using the multistage GN algorithm, we have been able to improve system performance (AUC) by 1.719 percent compared to a one-stage GN algorithm. Our experimental results also show that with full text, versus abstract only, INT AUC performance was 22.6 percent higher.

Index Terms:
Data mining, feature evaluation and selection, mining methods and algorithms, text mining, scientific databases.
Hong-Jie Dai, Po-Ting Lai, Richard Tzong-Han Tsai, "Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 412-420, July-Sept. 2010, doi:10.1109/TCBB.2010.45
