The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - Sept.-Oct. (2013 vol.10)
pp: 1218-1233
Su Yan , IBM Almaden Res. Center, San Jose, CA, USA
W. Scott Spangler , IBM Almaden Res. Center, San Jose, CA, USA
Ying Chen , IBM Almaden Res. Center, San Jose, CA, USA
ABSTRACT
The automation of extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable and good quality data to train a reliable entity extraction model. Another difficulty is the selection of informative features of chemical names, since comprehensive domain knowledge on chemistry nomenclature is required. Leveraging random text generation techniques, we explore the idea of automatically creating training sets for the task of chemical name extraction. Assuming the availability of an incomplete list of chemical names, called a dictionary, we are able to generate well-controlled, random, yet realistic chemical-like training documents. We statistically analyze the construction of chemical names based on the incomplete dictionary, and propose a series of new features, without relying on any domain knowledge. Compared to state-of-the-art models learned from manually labeled data and domain knowledge, our solution shows better or comparable results in annotating real-world data with less human effort. Moreover, we report an interesting observation about the language for chemical names. That is, both the structural and semantic components of chemical names follow a Zipfian distribution, which resembles many natural languages.
INDEX TERMS
Chemical extraction, Hidden Markov models, Feature extraction, Training, Data mining, Systematics, Data models,conditional random fields, Chemical name extraction, formal grammar, feature design, IUPAC names, patent analysis, drug research
CITATION
Su Yan, W. Scott Spangler, Ying Chen, "Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 5, pp. 1218-1233, Sept.-Oct. 2013, doi:10.1109/TCBB.2013.101
REFERENCES
[1] K. Pastra, D. Maynard, H. Cunningham, O. Hamza, and Y. Wilks, "How Feasible Is the Reuse of Grammars for Named Entity Recognition?" Proc. Third Language Resources and Evaluation Conf., 2002.
[2] K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, "An Efficient Filter for Approximate Membership Checking," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 805-818, 2008.
[3] I. U. of Pure, A. C. C. on the Nomenclature of Organic Chemistry, R. Panico, W. Powell, and J. Richer, A Guide to IUPAC Nomenclature of Organic Compounds: Recommendations 1993, IUPAC Chemical Data Series, 1993.
[4] G.A. Eller, "Improving the Quality of Published Chemical Names with Nomenclature Software," Molecules, vol. 11, no. 11, pp. 915-928, 2006.
[5] L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Readings in Speech Recognition, pp. 267-296, Morgan Kaufmann, 1990.
[6] C. Cortes and V. Vapnik, "Support-Vector Networks," Machine Learning, vol. 20, no. 3, pp. 273-297, Sept. 1995.
[7] A. McCallum, D. Freitag, and F.C.N. Pereira, "Maximum Entropy Markov Models for Information Extraction and Segmentation," Proc. 18th Int'l Conf. Machine Learning (ICML '00), pp. 591-598, 2000.
[8] J.D. Lafferty, A. McCallum, and F.C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. 18th Int'l Conf. Machine Learning (ICML '01), pp. 282-289, 2001.
[9] I.V. Filippov and M.C. Nicklaus, "Optical Structure Recognition Software to Recover Chemical Information: OSRA, an Open Source Solution." J. Chemical Information and Modeling, vol. 49, no. 3, pp. 740-743, 2009.
[10] S. Yan, W.S. Spangler, and Y. Chen, "Cross Media Entity Extraction and Linkage for Chemical Documents," Proc. 25th AAAI Conf. Artificial Intelligence (AAAI '11), 2011.
[11] D. Weininger, "SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules," J. Chemical Information and Computer Science, vol. 28, no. 1, pp. 31-36, Feb. 1988.
[12] R. Klinger, C. Kolárik, J. Fluck, M. Hofmann-Apitius, and C.M. Friedrich, "Detection of IUPAC and IUPAC-Like Chemical Names," Bioinformatics, vol. 24, pp. 268-276, 2008.
[13] P. Corbett, C. Batchelor, and S. Teufel, "Annotation of Chemical Named Entities," Proc. Workshop BioNLP 2007: Biological, Translational, and Clinical Language Processing (BioNLP '07), pp. 57-64, 2007.
[14] B. Sun, P. Mitra, and C.L. Giles, "Mining, Indexing, and Searching for Textual Chemical Molecule Information on the Web," Proc. Int'l Conf. World Wide Web (WWW '08), pp. 735-744, 2008.
[15] C.M. Friedrich, T. Revillion, M. Hofmann, and J. Fluck, "Biomedical and Chemical Named Entity Recognition with Conditional Random Fields: The Advantage of Dictionary Features," Proc. Second Int'l Symp. Semantic Mining in Biomedicine (SMBM '06), pp. 85-89, 2006.
[16] M. Krallinger, "BioCreAtIve Challenge Evaluation," http:/biocreative.sourceforge.net/, 2013.
[17] H.A. Simon, "On a Class of Skew Distribution Functions," Biometrika, vol. 42, nos. 3-4, pp. 425-440, 1955.
[18] L.Q. Ha, E.I. Sicilia-Garcia, J. Ming, and F.J. Smith, "Extension of Zipf's Law to Words and Phrases," Proc. 19th Int'l Conf. Computational Linguistics, pp. 1-6, 2002.
[19] C. Biemann, "A Random Text Model for the Generation of Statistical Language Invariants," Proc. Human Language Technologies: The Ann. Conf. North Am. Chapter of the Assoc. for Computational Linguistics, pp. 105-112, Apr. 2007.
[20] G.K. Zipf, Human Behavior and the Principle of Least Effort. Martino Fine Books, 1949.
[21] A.C. Bulhak, "On the Simulation of Postmodernism and Mental Debility Using Recursive Transition Networks," technical report, 1996.
[22] J. Stribling, M. Krohn, and D. Aguayo, "SCIgen—An Automatic CS Paper Generator," http://www.pdos.lcs.mit.eduscigen/, 2006.
[23] N. Chomsky, "Three Models for the Description of Language," IRE Trans. Information Theory, vol. 2, pp. 113-124, 1956.
[24] D. Walter, Structure-Based Approaches to the Indexing and Retrieval of Patent Chemistry. Thomson Reuters, 2010.
[25] W.J. Wilbur, G.F. Hazard, G. Divita, J.G. Mork, A.R. Aronson, and A.C. Browne, "Analysis of Biomedical Text for Chemical Names: A Comparison of Three Methods," Proc. AMIA Symp., pp. 176-180, 1999.
[26] P. Corbett and A. Copestake, "Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition," Proc. Workshop Current Trends in Biomedical Natural Language Processing (BioNLP '08), pp. 54-62, 2008.
[27] C. Sutton and A. Mccallum, Introduction to Conditional Random Fields for Relational Learning. MIT Press, 2006.
[28] N. Okazaki, "CRFsuite: A Fast Implementation of Conditional Random Fields (CRFs)," http://www.chokkan.org/softwarecrfsuite/, 2007.
482 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool