The Community for Technology Leaders
RSS Icon
Issue No.02 - February (2012 vol.24)
pp: 236-250
Chien-Chih Chen , National Taiwan University, Taipei
Kai-Hsiang Yang , National Taipei University of Education, Taipei
Chuen-Liang Chen , National Taiwan University, Taipei
Jan-Ming Ho , Academia Sinica, Taipei
Dramatic increase in the number of academic publications has led to growing demand for efficient organization of the resources to meet researchers' needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. However, publications by different conferences and journals adopt different citation styles. It is an interesting problem to accurately extract metadata from a citation string which is formatted in one of thousands of different styles. It has attracted a great deal of attention in research in recent years. In this paper, based on the notion of sequence alignment, we present a citation parser called BibPro that extracts components of a citation string. To demonstrate the efficacy of BibPro, we conducted experiments on three benchmark data sets. The results show that BibPro achieved over 90 percent accuracy on each benchmark. Even with citations and associated metadata retrieved from the web as training data, our experiments show that BibPro still achieves a reasonable performance.
Data integration, digital libraries, information extraction, sequence alignment.
Chien-Chih Chen, Kai-Hsiang Yang, Chuen-Liang Chen, Jan-Ming Ho, "BibPro: A Citation Parser Based on Sequence Alignment", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 2, pp. 236-250, February 2012, doi:10.1109/TKDE.2010.231
[1] D. Lee, J. Kang, P. Mitra, C.L. Giles, and B.-W. On, "Are Your Citations Clean?," Comm. ACM, vol. 50, pp. 33-38, 2007.
[2] M. Cristo, P. Calado, M.A. Goncalves, E.S. de Moura, B. Ribeiro-Neto, and N. Ziviani, "Link-Based Similarity Measures for the Classification of Web Documents," J. Am. Soc. for Information Science and Technology, vol. 57, pp. 208-221, 2006.
[3] T. Couto, M. Cristo, M.A. Goncalves, P. Calado, N. Ziviani, E. Moura, and B. Ribeiro-Neto, "A Comparative Study of Citations and Links in Document Classification," Proc. Sixth ACM/IEEE-CS Joint Conf. Digital Libraries, 2006.
[4] M.A. Goncalves, B.L. Moreira, E.A. Fox, and L.T. Watson, "'What Is a Good Digital Library?' - A Quality Model for Digital Libraries," Information Processing and Management, vol. 43, pp. 1416-1437, 2007.
[5] S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Proc. Seventh Int'l Conf. World Wide Web 7, 1998.
[6] A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, and J.S. Teixeira, "A Brief Survey of Web Data Extraction Tools," SIGMOD Record, vol. 31, pp. 84-93, 2002.
[7] C.L. Giles, K. Bollacker, and S. Lawrence, "CiteSeer: An Automatic Citation Indexing System," DL '98: Proc. Third ACM Conf. Digital Libraries, pp. 89-98, 1998.
[8] K.D. Bollacker, S. Lawrence, and C.L. Giles, "CiteSeer: An Autonous Web Agent for Automatic Retrieval and Identification of Interesting Publications," Proc. Second Int'l Conf. Autonomous Agents, 1998.
[9] S. Lawrence, C.L. Giles, and K.D. Bollacker, "Autonomous Citation Matching," Proc. Third Ann. Conf. Autonomous Agents, 1999.
[10] S. Lawrence, C.L. Giles, and K.D. Bollacker, "Digital Libraries and Autonomous Citation Indexing," Computer, vol. 32, no. 6, pp. 67-71, June 1999.
[11] M.-Y. Day, R.T.-H. Tsai, C.-L. Sung, C.-C. Hsieh, C.-W. Lee, S.-H. Wu, K.-P. Wu, C.-S. Ong, and W.-L. Hsu, "Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework," Decision Support Systems, vol. 43, pp. 152-167, 2007.
[12] E. Agichtein and V. Ganti, "Mining Reference Tables for Automatic Text Segmentation," Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2004.
[13] E. Cortez, A.S. da Silva, M.A. Gon calves, F. Mesquita, and E.S. de Moura, "FLUX-CiM: Flexible Unsupervised Extraction of Citation Metadata," Proc. Seventh ACM/IEEE-CS Joint Conf. Digital Libraries, pp. 215-224, 2007.
[14] Andrew McCallum's Code and Data, , 2005.
[15] F. Peng and A. McCallum, "Accurate Information Extraction from Research Papers Using Conditional Random Fields," Proc. Human Language Technology Conf. and North Am. Chapter of the Assoc. for Computational Linguistics (HLT-NAACL), pp. 329-336, 2004.
[16] H. Han, C.L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E.A. Fox, "Automatic Document Metadata Extraction Using Support Vector Machines," Proc. Third ACM/IEEE-CS Joint Conf. Digital Libraries, 2003.
[17] K. Seymore, A. McCallum, and R. Rosenfeld, "Learning Hidden Markov Model Structure for Information Extraction," Proc. Workshop Machine Learning for Information Extraction (AAAI '99), pp. 37-42, 1999.
[18] V. Borkar, K. Deshmukh, and S. Sarawagi, "Automatic Segmentation of Text into Structured Records," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2001.
[19] A. Takasu, "Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model," Proc. Third ACM/IEEE-CS Joint Conf. Digital Libraries, 2003.
[20] P. Yin, M. Zhang, Z. Deng, and D. Yang, "Metadata Extraction from Bibliographies Using Bigram HMM," Proc. Seventh Int'l Conf. Asian Digital Libraries, pp. 310-319, 2004.
[21] E. Hetzner, "A Simple Method for Citation Metadata Extraction Using Hidden Markov Models," Proc. Eighth ACM/IEEE-CS Joint Conf. Digital Libraries, 2008.
[22] I.-A. Huang, J.-M. Ho, H.-Y. Kao, and W.-C. Lin, "Extracting Citation Metadata from Online Publication Lists Using BLAST," Proc. Eighth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '04), Jan. 2004.
[23] C.-C. Chen, K.-H. Yang, and J.-M. Ho, "BibPro: A Citation Parser Based on Sequence Alignment Techniques," Proc. Int'l Conf. Advanced Information Networking and Applications (AINA '08), 2008.
[24] S.B. Needleman and C.D. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins," J. Molecular Biology, vol. 48, pp. 443-453, 1970.
[25] K.-H. Yang, J.-M. Chung, and J.-M. Ho, "PLF: A Publication List Web Page Finder for Researchers," Proc. IEEE/WIC/ACM Int'l Conf. Web Intelligence, 2007.
[26] S. Henikoff and J.G. Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," Proc. Nat'l Academy of Sciences USA, vol. 89, pp. 10915-10919, 1992.
[27] C. Chang, M. Kayed, M. Girgis, and K. Shaalan, "A Survey of Web Information Extraction Systems," IEEE Trans. Knowledge and Data Eng., vol. 18, no. 10, pp. 1411-1428, Oct. 2006.
[28] K.-H. Yang, S.-S. Chen, M.-T. Hsieh, H.-M. Lee, and J.-M. Ho, "CRE: An Automatic Citation Record Extractor for Publication List Pages," Proc. 12th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '08), 2008.
[29] I.G. Councill, C. Lee Giles, M.-Y. Kan, "ParsCit: An Opensource CRF Reference String Parsing Package," Proc. Language Resources and Evaluation Conf., 2008.
43 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool