The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - Dec. (2013 vol.35)
pp: 2891-2903
Girish Kulkarni , Comput. Sci. Dept., Stony Brook Univ., Stony Brook, NY, USA
Visruth Premraj , Comput. Sci. Dept., Stony Brook Univ., Stony Brook, NY, USA
Vicente Ordonez , Comput. Sci. Dept., Stony Brook Univ., Stony Brook, NY, USA
Sagnik Dhar , Comput. Sci. Dept., Stony Brook Univ., Stony Brook, NY, USA
Siming Li , Comput. Sci. Dept., Stony Brook Univ., Stony Brook, NY, USA
Yejin Choi , Comput. Sci. Dept., Stony Brook Univ., Stony Brook, NY, USA
Alexander C. Berg , Comput. Sci. Dept., Stony Brook Univ., Stony Brook, NY, USA
Tamara L. Berg , Comput. Sci. Dept., Stony Brook Univ., Stony Brook, NY, USA
ABSTRACT
We present a system to automatically generate natural language descriptions from images. This system consists of two parts. The first part, content planning, smooths the output of computer vision-based detection and recognition algorithms with statistics mined from large pools of visually descriptive text to determine the best content words to use to describe an image. The second step, surface realization, chooses words to construct natural language sentences based on the predicted content and general statistics from natural language. We present multiple approaches for the surface realization step and evaluate each using automatic measures of similarity to human generated reference descriptions. We also collect forced choice human evaluations between descriptions from the proposed generation system and descriptions from competing approaches. The proposed system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.
INDEX TERMS
Computer vision, Image segmentation, Natural language processing, Context awareness, Information analysis,Computer vision, image description generation
CITATION
Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, Tamara L. Berg, "BabyTalk: Understanding and Generating Simple Image Descriptions", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.35, no. 12, pp. 2891-2903, Dec. 2013, doi:10.1109/TPAMI.2012.162
REFERENCES
[1] A. Aker and R. Gaizauskas, "Generating Image Descriptions Using Dependency Relational Patterns," Proc. 28th Ann. Meeting Assoc. for Computational Linguistics, pp. 1250-1258, 2010.
[2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan, "Matching Words and Pictures," J. Machine Learning Research, vol. 3, pp. 1107-1135, 2003.
[3] K. Barnard, P. Duyguly, and D. Forsyth, "Clustering Art," Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2001.
[4] T.L. Berg, A.C. Berg, J. Edwards, and D.A. Forsyth, "Who's in the Picture?" Proc. Neural Information Processing Systems Conf., 2004.
[5] T.L. Berg, A.C. Berg, J. Edwards, M. Maire, R. White, E. Learned-Miller, Y.-W. Teh, and D.A. Forsyth, "Names and Faces," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2004.
[6] T.L. Berg, A.C. Berg, and J. Shih, "Automatic Attribute Discovery and Characterization from Noisy Web Data," Proc. European Conf. Computer Vision, 2010.
[7] T.L. Berg and D.A. Forsyth, "Animals on the Web," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006.
[8] M.-C. de Marnee and C.D. Manning, Stanford Typed Dependencies Manual, 2009.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[10] C. Desai, D. Ramanan, and C. Fowlkes, "Discriminative Models for Multi-Class Object Layout," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[11] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth, "Object Recognition as Machine Translation," Proc. European Conf. Computer Vision, 2002.
[12] M. Everingham, J. Sivic, and A. Zisserman, "Hello! My Name Is.. Buffy—Automatic Naming of Characters in TV Video," Proc. British Machine Vision Conf., 2006.
[13] A. Farhadi, I. Endres, D. Hoiem, and D.A. Forsyth, "Describing Objects by Their Attributes," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[14] A. Farhadi, I. Endres, D. Hoiem, and D.A. Forsyth, "Describing Objects by Their Attributes," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[15] A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D.A. Forsyth, "Every Picture Tells a Story: Generating Sentences for Images," Proc. European Conf. Computer Vision, 2010.
[16] L. Fei-Fei, C. Koch, A. Iyer, and P. Perona, "What Do We See When We Glance at a Scene," J. Vision, vol. 4, no. 8, 2004.
[17] P.F. Felzenszwalb, R.B. Girshick, and D. McAllester, "Discriminatively Trained Deformable Part Models, Release 4," http://people.cs.uchicago.edu/pfflatent-release4 /, 2012.
[18] Y. Feng and M. Lapata, "How Many Words Is a Picture Worth? Automatic Caption Generation for News Images," Proc. Assoc. for Computational Linguistics, pp. 1239-1249, 2010.
[19] V. Ferrari and A. Zisserman, "Learning Visual Attributes," Proc. Neural Information Processing Systems Conf., 2007.
[20] A. Gupta and L.S. Davis, "Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers," Proc. European Conf. Computer Vision, 2008.
[21] A. Gupta, P. Srinivasan, J. Shi, and L.S. Davis, "Understanding Videos Constructing Plots: Learning a Visually Grounded Storyline Model from Annotated Videos," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[22] S. Gupta and R. Mooney, "Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval," Proc. IEEE Computer Vision and Pattern Recognition Workshop Visual and Contextual Learning from Annotated Images and Videos, June 2009.
[23] S. Gupta and R.J. Mooney, "Using Closed Captions as Supervision for Video Activity Recognition," Proc. 24th AAAI Conf. Artificial Intelligenc, pp. 1083-1088, July 2010.
[24] A. Kojima, T. Tamura, and K. Fukunaga, "Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions," Int'l J. of Computer Vision, vol. 50, pp. 171-184, 2002.
[25] V. Kolmogorov, "Convergent Tree-Reweighted Message Passing for Energy Minimization," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1568-1583, Oct. 2006.
[26] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg, and T.L. Berg, "Babytalk: Understanding and Generating Simple Image Descriptions," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[27] N. Kumar, A.C. Berg, P.N. Belhumeur, and S.K. Nayar, "Attribute and Simile Classifiers for Face Verification," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[28] P. Kuznetsova, V. Ordonez, A.C. Berg, T.L. Berg, and Y. Choi, "Collective Generation of Natural Image Descriptions," Proc. Conf. Assoc. for Computational Linguistics, 2012.
[29] C. Lampert, H. Nickisch, and S. Harmeling, "Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[30] L.-J. Li and L. Fei-Fei, "OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning," Int'l J. Computer Vision, vol. 88, pp. 147-168, 2009.
[31] S. Li, G. Kulkarni, T.L. Berg, A.C. Berg, and Y. Choi, "Composing Simple Image Descriptions Using Web-Scale n-Grams," Proc. 15th Conf. Computational Natural Language Learning, pp. 220-228, June 2011.
[32] C.-Y. Lin and E. Hovy, "Automatic Evaluation of Summaries Using N-Gram Co-Occurrence Statistics," Proc. Conf. North Am. Chapter of the Assoc. for Computational Linguistics on Human Language Technology), 2003.
[33] V. Ordonez, G. Kulkarni, and T.L. Berg, "Im2text: Describing Images Using 1 Million Captioned Photographs," Proc. Neural Information Processing Systems), 2011.
[34] K. Papineni, S. Roukos, T. Ward, and W. Jing Zhu, "Bleu: A Method for Automatic Evaluation of Machine Translation," Proc. 40th Ann. Meeting of Assoc. for Computational Linguistics, pp. 311-318, 2002.
[35] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, "Collecting Image Annotations Using Amazon's Mechanical Turk," Proc. NAACL HLT Workshop Creating Speech and Language Data with Amazon's Mechanical Turk, 2010.
[36] K. Saenko and T. Darrell, "Unsupervised Learning of Visual Sense Models for Polysemous Words," Proc. Neural Information Processing Systems, 2008.
[37] F. Schroff, A. Criminisi, and A. Zisserman, "Harvesting Image Databases from the Web," Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[38] J. Shotton, J. Winn, C. Rother, and A. Criminisi, "Textonboost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context," Int'l J. Computer Vision, vol. 81, pp. 2-23, Jan. 2009.
[39] J. Sivic, M. Everingham, and A. Zisserman, ""Who Are You?"— Learning Person Specific Classifiers from Video," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[40] A. Torralba, R. Fergus, and W. Freeman, "80 Million Tiny Images: A Large Data Set for Non-Parametric Object and Scene Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 11, pp. 1958-1970, Nov. 2008.
[41] A. Torralba, K.P. Murphy, and W.T. Freeman, "Using the Forest to See the Trees: Exploiting Context for Visual Object Detection and Localization," Comm. ACM, vol. 53, pp. 107-114, Mar. 2010.
[42] M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky, "Map Estimation via Agreement on (Hyper)Trees: Message-Passing and Linear-Programming Approaches," IEEE Trans. Information Theory, vol. 51, no. 11, pp. 3697-3717, Nov. 2005.
[43] J. Wang, K. Markert, and M. Everingham, "Learning Models for Object Recognition from Natural Language Descriptions," Proc. British Machine Vision Conf., 2009.
[44] K. Yanai and K. Barnard, "Image Region Entropy: A Measure of 'Visualness' of Web Images Associated with one Concept," Proc. 13th Ann. ACM Int'l Conf. Multimedia, 2005.
[45] K. Yanai and K. Barnard, "Finding Visual Concepts by Web Image Mining," Proc. 15th Int'l Conf. World Wide Web, pp. 923-924, 2006.
[46] Y. Yang, C.L. Teo, H. DauméIII, and Y. Aloimonos, "Corpus-Guided Sentence Generation of Natural Images," Proc. Conf. Empirical Methods in Natural Language Processing, 2011.
[47] B. Yao, X. Yang, L. Lin, M.W. Lee, and S.-C. Zhu, "I2t: Image Parsing to Text Description," Proc. IEEE, vol. 98, no. 8, Aug. 2010.
[48] J. Zacks, B. Tversky, and G. Iyer, "Perceiving, Remembering, and Communicating Structure in Events," J. Experimental Psychology General, vol. 130, pp. 29-58, Mar. 2001.
[49] L. Zhou and E. Hovy, "Template-Filtered Headline Summarization," Proc. ACL Workshop Text Summarization Branches Out, July 2004.
75 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool