This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining
March/April 2011 (vol. 8 no. 2)
pp. 294-307
Yanpeng Li, Dalian University of Technology, Dalian and Drexel University, Philadelphia
Xiaohua Hu, Drexel University, Philadelphia
Hongfei Lin, Dalian University of Technology, Dalian
Zhihao Yang, Dalian University of Technology, Dalian
Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.

[1] A. Airola, S. Pyysalo, J. Björne, T. Pahikkala, F. Ginter, and T. Salakoski, "All-Paths Graph Kernel for Protein-Protein Interaction Extraction with Evaluation of Cross-Corpus Learning," BMC Bioinformatics, vol. 9, suppl. 11, p. S2, 2008.
[2] R.K. Ando and T. Zhang, "A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data," J. Machine Learning Research, vol. 6, pp. 1817-1853, 2005.
[3] R.K. Ando, "BioCreative II Gene Mention Tagging System at IBM Watson," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 101-103. 2007.
[4] A. Blum and T. Mitchell, "Combining Labeled and Unlabeled Data with Co-Training," Proc. 11th Ann. Conf. Computational Learning Theory (COLT), pp. 92-100, 1998.
[5] R. Bunescu, R. Ge, R. Kate, E. Marcotte, R. Mooney, A. Ramani, and Y. Wong, "Comparative Experiments on Learning Information Extractors for Proteins and their Interactions," Artificial Intelligence in Medicine, vol. 33, no. 2, pp. 139-155, 2005.
[6] K.W. Church and P. Hanks, "Word Association Norms, Mutual Information, and Lexicography," Computational Linguistics, vol. 16, no. 1, pp. 22-29, 1989.
[7] O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D.S. Weld, and A. Yates, "Unsupervised Named-Entity Extraction from the Web: An Experimental Study," Artificial Intelligence, vol. 165, no. 1, pp. 91-134, 2005.
[8] J. Finkel, S. Dingare, C.D. Manning, M. Nissim, B. Alex, and C. Grover, "Exploring the Boundaries: Gene and Protein Identification in Biomedical Text," BMC Bioinformatics, vol. 6, suppl. 1, p. S5, 2005.
[9] E. Gabrilovich and S. Markovitch, "Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis," Proc. 20th Int'l Joint Conf. Artificial Intelligence (IJCAI), pp. 1606-1611, 2007.
[10] K. Ganchev, K. Crammer, F. Pereira, G. Mann, K. Bellare, A. McCallum, S. Carroll, Y. Jin, and P. White, "Penn/UMass/CHOP Biocreative II Systems," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 119-124, 2007.
[11] W. Hersh and R.T. Bhupatiraju, "TREC Genomics Track Overview," Proc. 14th Text Retrieval Conf. (TREC '05), 2005.
[12] W. Hersh, A. Cohen, P. Roberts, and H.K. Rekapalli, "TREC 2006 Genomics Track Overview," Proc. 15th Text Retrieval Conf. (TREC '06), 2006.
[13] C.N. Hsu, Y.M. Chang, C.J. Kuo, Y.S. Lin, H.S. Huang, and I.F. Chung, "Integrating High Dimensional Bi-Directional Parsing Models for Gene Mention Tagging," Bioinformatics, vol. 24, pp. i286-i294, 2008.
[14] R. Leaman and G. Gonzalez, "BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition," Proc. Pacific Symp. Biocomputing, vol. 13, pp. 652-663, 2008.
[15] Y. Li, H. Lin, and Z. Yang, "Incorporating Rich Background Knowledge for Gene Named Entity Classification and Recognition," BMC Bioinformatics, vol. 10, p. 223, 2009.
[16] Y. Li, X. Hu, H. Lin, and Z. Yang, "Learning an Enriched Representation from Unlabeled Data for Protein-Protein Interaction Extraction," BMC Bioinformatics, vol. 11, suppl. 2, p. S7, 2010.
[17] H. Liu, Z.Z. Hu, J. Zhang, and C. Wu, "BioThesaurus: A Web-Based Thesaurus of Protein and Gene Names," Bioinformatics, vol. 22, pp. 103-105, 2006.
[18] A. McCallum, "Efficiently Inducing Features of Conditional Random Fields," Proc. 19th Conf. Uncertainty in Artificial Intelligence (UAI '03), 2003.
[19] Y. Miyao, R. Sætre, K. Sagae, T. Matsuzaki, and J. Tsujii, "Evaluating Contributions of Natural Language Parsers to Protein-Protein Interaction Extraction," Bioinformatics, vol. 25, no. 3, pp. 394-400, 2009.
[20] M. Miwa, R. Sætre, Y. Miyao, T. Ohta, J. Tsujii, "Combining Multiple Layers of Syntactic Information for Protein-Protein Interaction Extraction," Proc. Third Int'l Symp. Semantic Mining in Biomedicine, pp. 101-108, 2008.
[21] M. Miwa, R. Sætre, Y. Miyao, and J. Tsujii, "A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora," Proc. Conf. Empirical Methods in Natural Language Processing, pp. 121-130, 2009.
[22] J. Niu et al., "WIM at TREC 2005," Proc. 14th Text Retrieval Conf. (TREC '05), 2005.
[23] C.S. Oliveira, F.G. Cozman, and I. Cohen, "Splitting the Unsupervised and Supervised Components of Semi-Supervised Learning," Proc. 22nd Int'l Conf. Machine Learning (ICML) Workshop Learning with Partially Classified Training Data, pp. 67-74, 2005.
[24] S. Pyysalo, A. Airola, J. Heimonen, J. Björne, F. Ginter, and T. Salakoski, "Comparative Analysis of Five Protein-Protein Interaction Corpora," BMC Bioinformatics, vol. 9, suppl. 3, p. S6, 2008.
[25] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng, "Self-Taught Learning: Transfer Learning from Unlabeled Data," Proc. 24th Int'l Conf. Machine Learning, pp. 759-766, 2007.
[26] L.V. Subramaniam, S. Mukherjea, and D. Punjani, "Biomedical Document Triage: Automatic Classification Exploiting Category Specific Knowledge," Proc. 14th Text Retrieval Conf. (TREC '05), 2005.
[27] L. Tanabe and W.J. Wilbur, "Generation of a Large Gene/Protein Lexicon by Morphological Pattern Analysis," J. Bioinformatics and Computational Biology, vol. 1, no. 4, pp. 611-626, 2004.
[28] V.N. Vapnik, "Statistical Learning Theory," John Wiley and Sons, 1998.
[29] J. Wilbur, L. Smith, and L. Tanabe, "BioCreative 2. Gene Mention Task," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 7-16, 2007.
[30] Z. Yang, H. Lin, Y. Li, B. Liu, and Y. Lu, "TREC 2005 Genomics Track Experiments at DUTAI," Proc. 14th Text Retrieval Conf. (TREC '05), 2005,
[31] X. Zhu, "Semi-Supervised Learning Literature Survey," Technical Report 1530, Univ. of Wisconsin-Madison, 2008.

Index Terms:
Feature coupling generalization, biomedical literature mining, semisupervised learning, named entity recognition, protein-protein interaction extraction, text classification.
Citation:
Yanpeng Li, Xiaohua Hu, Hongfei Lin, Zhihao Yang, "A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 2, pp. 294-307, March-April 2011, doi:10.1109/TCBB.2010.99
Usage of this product signifies your acceptance of the Terms of Use.