This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set
Sept.-Oct. 2013 (vol. 10 no. 5)
pp. 1218-1233
Su Yan, IBM, San Jose
W. Scott Spangler, IBM, San Jose
Ying Chen, IBM, San Jose
The automation of extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable and good quality data to train a reliable entity extraction model. Another difficulty is the selection of informative features of chemical names, since comprehensive domain knowledge on chemistry nomenclature is required. Leveraging random text generation techniques, we explore the idea of automatically creating training sets for the task of chemical name extraction. Assuming the availability of an incomplete list of chemical names, called a dictionary, we are able to generate well-controlled, random, yet realistic chemical-like training documents. We statistically analyze the construction of chemical names based on the incomplete dictionary, and propose a series of new features, without relying on any domain knowledge. Compared to state-of-the-art models learned from manually labeled data and domain knowledge, our solution shows better or comparable results in annotating real-world data with less human effort. Moreover, we report an interesting observation about the language for chemical names. That is, both the structural and semantic components of chemical names follow a Zipfian distribution, which resembles many natural languages.
Index Terms:
Chemical extraction,Hidden Markov models,Feature extraction,Training,Data mining,Systematics,Data models,conditional random fields,Chemical name extraction,formal grammar,feature design,IUPAC names,patent analysis,drug research
Citation:
Su Yan, W. Scott Spangler, Ying Chen, "Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 10, no. 5, pp. 1218-1233, Sept.-Oct. 2013, doi:10.1109/TCBB.2013.101
Usage of this product signifies your acceptance of the Terms of Use.