The Community for Technology Leaders
RSS Icon
Issue No.04 - April (2013 vol.35)
pp: 797-812
Yansong Feng , Inst. of Comput. Sci. & Technol., Peking Univ., Beijing, China
M. Lapata , Inst. for Language, Cognition & Comput., Univ. of Edinburgh, Edinburgh, UK
This paper is concerned with the task of automatically generating captions for images, which is important for many image-related applications. Examples include video and image retrieval as well as the development of tools that aid visually impaired individuals to access pictorial information. Our approach leverages the vast resource of pictures available on the web and the fact that many of them are captioned and colocated with thematically related documents. Our model learns to create captions from a database of news articles, the pictures embedded in them, and their captions, and consists of two stages. Content selection identifies what the image and accompanying article are about, whereas surface realization determines how to verbalize the chosen content. We approximate content selection with a probabilistic image annotation model that suggests keywords for an image. The model postulates that images and their textual descriptions are generated by a shared set of latent variables (topics) and is trained on a weakly labeled dataset (which treats the captions and associated news articles as image labels). Inspired by recent work in summarization, we propose extractive and abstractive surface realization models. Experimental results show that it is viable to generate captions that are pertinent to the specific content of an image and its associated article, while permitting creativity in the description. Indeed, the output of our abstractive model compares favorably to handwritten captions and is often superior to extractive methods.
Visualization, Humans, Databases, Vocabulary, Probabilistic logic, Data models, Noise measurement,topic models, Caption generation, image annotation, summarization
Yansong Feng, M. Lapata, "Automatic Caption Generation for News Images", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.35, no. 4, pp. 797-812, April 2013, doi:10.1109/TPAMI.2012.118
[1] A. Vailaya, M. Figueiredo, A. Jain, and H. Zhang, "Image Classification for Content-Based Indexing," IEEE Trans. Image Processing, vol. 10, no. 1, pp. 117-130, 2001.
[2] A.W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, "Content-Based Image Retrieval at the End of the Early Years," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349-1380, Dec. 2000.
[3] P. Duygulu, K. Barnard, J. de Freitas, and D. Forsyth, "Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary," Proc. Seventh European Conf. Computer Vision, pp. 97-112, 2002.
[4] D. Blei, "Probabilistic Models of Text and Images," PhD dissertation, Univ. of Massachusetts, Amherst, Sept. 2004.
[5] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, and M. Jordan, "Matching Words and Pictures," J. Machine Learning Research, vol. 3, pp. 1107-1135, 2002.
[6] C. Wang, D. Blei, and L. Fei-Fei, "Simultaneous Image Classification and Annotation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1903-1910, 2009.
[7] V. Lavrenko, R. Manmatha, and J. Jeon, "A Model for Learning the Semantics of Pictures," Proc. 16th Conf. Advances in Neural Information Processing Systems, 2003.
[8] S. Feng, V. Lavrenko, and R. Manmatha, "Multiple Bernoulli Relevance Models for Image and Video Annotation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1002-1009, 2004.
[9] L. Ferres, A. Parush, S. Roberts, and G. Lindgaard, "Helping People with Visual Impairments Gain Access to Graphical Information through Natural Language: The igraph System," Proc. 11th Int'l Conf. Computers Helping People with Special Needs, pp. 1122-1130, 2006.
[10] A. Abella, J.R. Kender, and J. Starren, "Description Generation of Abnormal Densities Found in Radiographs," Proc. Symp. Computer Applications in Medical Care, Am. Medical Informatics Assoc., pp. 542-546, 1995.
[11] A. Kojima, T. Tamura, and K. Fukunaga, "Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions," Int'l J. Computer Vision, vol. 50, no. 2, pp. 171-184, 2002.
[12] A. Kojima, M. Takaya, S. Aoki, T. Miyamoto, and K. Fukunaga, "Recognition and Textual Description of Human Activities by Mobile Robot," Proc. Third Int'l Conf. Innovative Computing Information and Control, pp. 53-56, 2008.
[13] P. Héde, P.A. Moëllic, J. Bourgeoys, M. Joint, and C. Thomas, "Automatic Generation of Natural Language Descriptions for Images," Proc. Recherche d'Information Assistée par Ordinateur, 2004.
[14] B. Yao, X. Yang, L. Lin, M.W. Lee, and S. Chun Zhu, "I2T: Image Parsing to Text Description," Proc. IEEE, vol. 98, no. 8, pp. 1485-1508, 2009.
[15] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg, and T.L. Berg, "Baby Talk: Understanding and Generating Image Descriptions," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1601-1608, 2011.
[16] A. Farhadi, M. Hejrati, A. Sadeghi, P. Yong, C. Rashtchian, J. Hockenmaier, and D. Forsyth, "Every Picture Tells a Story: Generating Sentences from Images," Proc. 11th European Conf. Computer Vision, pp. 15-29, 2010.
[17] V. Ordonez, G. Kulkarni, and T.L. Berg, "Im2Text: Describing Images Using 1 Million Captioned Photographs," Advances in Neural Information Processing Systems, vol. 24, pp. 1143-1151, 2011.
[18] A. Makadia, V. Pavlovic, and S. Kumar, "Baselines for Image Annotation," Int'l J. Computer Vision, vol. 90, no. 1, pp. 88-105, 2010.
[19] C.-F. Chai and C. Hung, "Automatically Annotating Images with Keywords: A Review of Image Annotation Systems," Recent Patents on Computer Science, vol. 1, pp. 55-68, 2008.
[20] J.-Y. Pan, H.-J. Yang, and C. Faloutsos, "MMSS: Multi-Modal Story-Oriented Video Summarization," Proc. Fourth IEEE Conf. Data Mining, pp. 491-494, 2004.
[21] T. Hofmann, "Unsupervised Learning by Probabilistic Latent Semantic Analysis," Machine Learning, vol. 41, no. 2, pp. 177-196, 2001.
[22] F. Monay and D. Gatica-Perez, "Modeling Semantic Aspects for Cross-Media Image Indexing," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1802-1817, Oct. 2007.
[23] T.L. Berg, A.C. Berg, J. Edwards, and D. Forsyth, "Who's in the Picture," Advances in Neural Information Processing Systems, vol. 17, pp. 137-144, 2005.
[24] M. Özcan, L. Jie, V. Ferrari, and B. Caputo, "A Large-Scale Database of Images and Captions for Automatic Face Naming," Proc. British Machine Vision Conf., pp. 1-11, 2011.
[25] J. Luo, B. Caputo, and V. Ferrari, "Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation," Advances in Neural Information Processing Systems, vol. 22, pp. 1168-1176, 2009.
[26] J. Wang, K. Markert, and M. Everingham, "Learning Models for Object Recognition from Natural Language Descriptions," Proc. British Machine Vision Conf., 2009.
[27] S. Ju Hwang and K. Grauman, "Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search," Int'l J. Computer Vision, pp. 1-20, 2011.
[28] V.O. Mittal, J.D. Moore, G. Carenini, and S. Roth, "Describing Complex Charts in Natural Language: A Caption Generation System," Computational Linguistics, vol. 24, pp. 431-468, 1998.
[29] M. Corio and G. Lapalme, "Generation of Texts for Information Graphics," Proc. Seventh European Workshop Natural Language Generation, pp. 49-58, 1999.
[30] S. Elzer, S. Carberry, I. Zukerman, D. Chester, N. Green, and S. Demir, "A Probabilistic Framework for Recognizing Intention in Information Graphics," Proc. 19th Int'l Conf. Artificial Intelligence, pp. 1042-1047, 2005.
[31] A. Aker and R. Gaizauskas, "Generating Image Descriptions Using Dependency Relational Patterns," Proc. 48th Ann. Meeting Assoc. for Computational Linguistics, pp. 1250-1258, 2010.
[32] M. Banko, V. Mittal, and M. Witbrock, "Headline Generation Based on Statistical Translation," Proc. 38th Ann. Meeting Assoc. for Computational Linguistics, pp. 318-325, 2000.
[33] D. Martin, C. Fowlkes, D. Tal, and J. Malik, "A Database of Human Segmented Natural Images and Its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics," Proc. Eighth IEEE Int'l Conf. Computer Vision, pp. 416-423, 2001.
[34] L. Fei-Fei, R. Fergus, and P. Perona, "Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories," Proc. Workshop Generative-Model Based Vision, pp. 59-70, 2004.
[35] G. Griffin, A. Holub, and P. Perona, "Caltech 256 Object Category Data Set," Technical Report 7694, California Inst. of Technology, http://authors.library.caltech.edu7694, 2007.
[36] F. Schroff, A. Criminisi, and A. Zisserman, "Harvesting Image Databases from the Web," Proc. 11th IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[37] B.C. Russell, A. Torralba, K.P. Murphy, and W.T. Freeman, "LabelMe: A Database and Web-Based Tool for Image Annotation," Int'l J. Computer Vision, vol. 77, nos. 1-3, pp. 157-173, 2008.
[38] K. Barnard, Q. Fan, R. Swaminathan, A. Hoogs, R. Collins, P. Rondot, and J. Kaufhold, "Evaluation of Localized Semantics: Data, Methodology, and Experiments," Int'l J. Computer Vision, vol. 77, nos. 1-3, pp. 199-217, 2008.
[39] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 248-255, 2009.
[40] M. Hodosh, P. Young, C. Rashtchian, and J. Hockenmaier, "Cross-Caption Coreference Resolution for Automatic Image Understanding," Proc. 14th Conf. Computational Natural Language Learning, pp. 162-171, 2010.
[41] F. Keller, S. Gunasekharan, N. Mayo, and M. Corley, "Timing Accuracy of Web Experiments: A Case Study Using the WebExp Software Package," Behavior Research Methods, vol. 41, no. 1, pp. 1-12, 2009.
[42] D. Blei and M. Jordan, "Modeling Annotated Data," Proc. 26th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 127-134, 2003.
[43] D. Lowe, "Object Recognition from Local Scale-Invariant Features," Proc. IEEE Int'l Conf. Computer Vision, pp. 1150-1157, 1999.
[44] D. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[45] K. Mikolajczyk and C. Schmid, "A Performance Evaluation of Local Descriptors," Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 257-263, 2003.
[46] A. Bosch, "Image Classification for a Large Number of Object Categories," PhD dissertation, Universitat de Girona, Sept. 2007.
[47] D. Blei, A. Ng, and M. Jordan, "Latent Dirichlet Allocation," J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[48] K. Sparck Jones, "Automatic Summarizing: Factors and Directions," Advances in Automatic Text Summarization, I. Mani and M.T. Maybury, eds., pp. 1-33, MIT Press, 1999.
[49] I. Mani, Automatic Summarization. John Benjamins Publishing Co., 2001.
[50] G. Salton and M. McGill, Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[51] M. Steyvers and T. Griffiths, "Probabilistic Topic Models," A Handbook of Latent Semantic Analysis, T. Landauer, D. McNamara, S. Dennis, and W. Kintsch, eds. Psychology Press, 2007.
[52] M. Witbrock and V. Mittal, "Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries," Proc. 22nd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 315-316, 1999.
[53] R. Kneser, J. Peters, and D. Klakow, "Language Model Adaptation Using Dynamic Marginals," Proc. Fifth European Conf. Speech Comm. and Technology, vol. 4, pp. 1971-1974, 1997.
[54] L. Zhou and E. Hovy, "Headline Summarization at ISI," Proc. HLT-NAACL Text Summarization Workshop and Document Understanding Conf., pp. 174-178, 2003.
[55] R. Soricut and D. Marcu, "Stochastic Language Generation Using WIDL-Expressions and Its Application in Machine Translation and Summarization," Proc. 21st Int'l Conf. Computational Linguistics and the 44th Ann. Meeting Assoc. for Computational Linguistics, pp. 1105-1112, 2006.
[56] S. Wan, R. Dale, M. Dras, and C. Paris, "Statistically Generated Summary Sentences: A Preliminary Evaluation of Verisimilitude Using Precision of Dependency Relations," Proc. Workshop Using Corpora for Natural Language Generation, 2005.
[57] H. Schmid, "Probabilistic Part-of-Speech Tagging Using Decision Trees," Proc. Int'l Conf. New Methods in Language Processing, 1994.
[58] Y. Feng and M. Lapata, "Automatic Image Annotation Using Auxiliary Text Information," Proc. 46th Ann. Meeting Assoc. of Computational Linguistics: Human Language Technologies, pp. 272-280, 2008.
[59] C. Buckley and E.M. Voorhees, "Retrieval System Evaluation," TREC: Experiment and Evaluation in Information Retrieval, E.M. Voorhees and D.K. Harman, eds., pp. 53-78, MIT Press, 2005.
[60] E.W. Noreen, Computer-Intensive Methods for Testing Hypotheses: An Introduction. John Wiley & Sons, Inc., 1989.
[61] D. Klein and C.D. Manning, "Accurate Unlexicalized Parsing," Proc. 41st Ann. Meeting Assoc. of Computational Linguistics, pp. 423-430, 2003.
[62] M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul, "A Study of Translation Edit Rate with Targeted Human Annotation," Proc. Seventh Conf. Assoc. for Machine Translation in the Americas, pp. 223-231, 2006.
[63] A. Ahmed, E.P. Xing, W.W. Cohen, and R.F. Murphy, "Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature," Proc. ACM SIGKDD 15th Int'l Conf. Knowledge Discovery and Data Mining, pp. 39-48, 2009.
[64] R. Socher and L. Fei-Fei, "Connecting Modalities: Semi-Supervised Segmentation and Annotation of Images Using Unaligned Text Corpora," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 966-973, 2010.
[65] J. Boyd-Graber and D. Blei, "Syntactic Topic Models," Proc. 22nd Conf. Advances in Neural Information Processing Systems, 2009.
[66] A. Sadeghi and A. Farhadi, "Recognition Using Visual Phrases," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1745-1752, 2011.
78 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool