The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.09 - September (2011 vol.23)
pp: 1373-1387
Dmitri V. Kalashnikov , University of California, Irvine
Sharad Mehrotra , University of California, Irvine
Jie Xu , University of California, Irvine
Nalini Venkatasubramanian , University of California, Irvine
ABSTRACT
Associating textual annotations/tags with multimedia content is among the most effective approaches to organize and to support search over digital images and multimedia databases. Despite advances in multimedia analysis, effective tagging remains largely a manual process wherein users add descriptive tags by hand, usually when uploading or browsing the collection, much after the pictures have been taken. This approach, however, is not convenient in all situations or for many applications, e.g., when users would like to publish and share pictures with others in real time. An alternate approach is to instead utilize a speech interface using which users may specify image tags that can be transcribed into textual annotations by employing automated speech recognizers. Such a speech-based approach has all the benefits of human tagging without the cumbersomeness and impracticality typically associated with human tagging in real time. The key challenge in such an approach is the potential low recognition quality of the state-of-the-art recognizers, especially, in noisy environments. In this paper, we explore how semantic knowledge in the form of co-occurrence between image tags can be exploited to boost the quality of speech recognition. We postulate the problem of speech annotation as that of disambiguating among multiple alternatives offered by the recognizer. An empirical evaluation has been conducted over both real speech recognizer's output as well as synthetic data sets. The results demonstrate significant advantages of the proposed approach compared to the recognizer's output under varying conditions.
INDEX TERMS
Using speech for tagging and annotation, using semantics to improve ASR, maximum entropy approach, correlation-based approach, branch and bound algorithm.
CITATION
Dmitri V. Kalashnikov, Sharad Mehrotra, Jie Xu, Nalini Venkatasubramanian, "A Semantics-Based Approach for Speech Annotation of Images", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 9, pp. 1373-1387, September 2011, doi:10.1109/TKDE.2010.185
REFERENCES
[1] R. Bayeza-Yates and B. Riberto-Neto, Modern Information Retrieval. Addison-Wesley, 1999.
[2] D.M. Blei and M.I. Jordan, "Modeling Annotated Data," Proc. ACM SIGIR, 2003.
[3] J. Chen, T. Tan, and P. Mulhem, "A Method for Photograph Indexing Using Speech Annotation," Proc. Second IEEE Pacific Rim Conf. Multimedia: Advances in Multimedia Information Processing (PCM), 2001.
[4] S. Chen, D.V. Kalashnikov, and S. Mehrotra, "Adaptive Graphical Approach to Entity Resolution," Proc. ACM/IEEE Joint Conf. Digital Libraries (JCDL), June 2007.
[5] Z. Chen, D.V. Kalashnikov, and S. Mehrotra, "Exploiting Relationships for Object Consolidation," Proc. ACM SIGMOD Workshop Information Quality in Information Systems (IQIS '05), June 2005.
[6] Z.S. Chen, D.V. Kalashnikov, and S. Mehrotra, "Exploiting Context Analysis for Combining Multiple Entity Resolution Systems," Proc. ACM SIGMOD, June/July 2009.
[7] C. Desai, D.V. Kalashnikov, S. Mehrotra, and N. Venkatasubramanian, "Using Semantics for Speech Annotation of Images," Proc. IEEE Int'l Conf. Data Eng. (ICDE), Mar./Apr. 2009.
[8] O. Díaz, J. Iturrioz, and C. Arellano, "Facing Tagging Data Scattering," Proc. Int'l Conf. Web Information Systems Eng. (WISE), 2009.
[9] T. Hofmann, "Unsupervised Learning by Probabilistic Latent Semantic Analysis," Machine Learning, vol. 42, nos. 1/2, pp. 177-196, 2001.
[10] Y. Jin, L. Khan, L. Wang, and M. Awad, "Image Annotations by Combining Multiple Evidence & Wordnet," Proc. ACM Int'l Conf. Multimedia, pp. 706-715, 2005.
[11] D. Jurafsky and J. Martin, Speech and Language Processing. Prentice-Hall, 2000.
[12] D.V. Kalashnikov, Z. Chen, S. Mehrotra, and R. Nuray, "Web People Search via Connection Analysis," IEEE Trans. Knowledge and Data Eng., vol. 20, no. 11, pp. 1550-1565, Nov. 2008.
[13] D.V. Kalashnikov, Z. Chen, R. Nuray-Turan, S. Mehrotra, and Z. Zhang, "WEST: Modern Technologies for Web People Search," Proc. IEEE Int'l Conf. Data Eng. (ICDE), demo publication, Mar./Apr. 2009.
[14] D.V. Kalashnikov and S. Mehrotra, "Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph," ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, 2006.
[15] D.V. Kalashnikov, S. Mehrotra, S. Chen, R. Nuray, and N. Ashish, "Disambiguation Algorithm for People Search on the Web," Proc. IEEE Int'l Conf. Data Eng. (ICDE), short publication, 2007.
[16] D.V. Kalashnikov, S. Mehrotra, and Z. Chen, "Exploiting Relationships for Domain-Independent Data Cleaning," Proc. SIAM Int'l Conf. Data Mining, 2005.
[17] D.V. Kalashnikov, R. Nuray-Turan, and S. Mehrotra, "Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search," Proc. ACM SIGIR, July 2008.
[18] M.P. Kato, H. Ohshima, S. Oyama, and K. Tanaka, "Can Social Tagging Improve Web Image Search?" Proc. Int'l Conf. Web Information Systems Eng. (WISE), 2008.
[19] A. Kuchinsky, C. Pering, M.L. Creech, D.F. Freeze, B. Serra, and J. Gwizdka, "FotoFile: A Consumer Multimedia Organization and Retrieval System," Proc. SIGCHI Conf. Human Factors in Computing Systems: The CHI is the Limit (CHI), 1999.
[20] R. Lienhart, "A System for Effortless Content Annotation to Unfold the Semantics in Videos," Proc. IEEE Workshop Content-Based Access of Image and Video Libraries (CBAIVL), 2000.
[21] R.W. Lienhart, "Dynamic Video Summarization of Home Video," Proc. SPIE Conf., 1999.
[22] C. Manning and H. Schutze, Foundations of Statistical Natural Language Processing. MIT Press, 1999.
[23] V. Markl, P.J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T.M. Tran, "Consistent Selectivity Estimation via Maximum Entropy," VLDB J., vol. 16, no. 1, pp. 55-76, 2007.
[24] F. Monay and D. Gatica-Perez, "On Image Auto-Annotation with Latent Space Models," Proc. ACM Int'l Conf. Multimedia, 2003.
[25] R. Nuray-Turan, Z. Chen, D.V. Kalashnikov, and S. Mehrotra, "Exploiting Web Querying for Web People Search in WePS2," Proc. Second Web People Search Evaluation Workshop (WePS 2009), 18th Int'l World Wide Web (WWW) Conf., Apr. 2009.
[26] R. Nuray-Turan, D.V. Kalashnikov, and S. Mehrotra, "Self-Tuning in Graph-Based Reference Disambiguation," Proc. 12th Int'l Conf. Database Systems for Advanced Applications (DASFAA), Apr. 2007.
[27] S.D. Pietra, V.J.D. Pietra, and J.D. Lafferty, "Inducing Features of Random Fields," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 4, pp. 380-393, Apr. 1997.
[28] SAFIRE Project, http://www.ics.uci.edu/projects/certSAFIRE /, 2010.
[29] C.E. Shannon, The Mathematical Theory of Communication. Univ. of Illinois Press, 1949.
[30] J. Shawe-Taylor and N. Cristianni, Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.
[31] R.K. Srihari and Z. Zhang, "Show&Tell: A Semi-Automated Image Annotation System," IEEE MultiMedia, vol. 7, no. 3, pp. 61-71, July-Sept. 2000.
[32] A. Stent and A. Loui, "Using Event Segmentation to Improve Indexing of Consumer Photographs," Proc. ACM SIGIR, 2001.
[33] C. Wang, F. Jing, L. Zhang, and H. Zhang, "Image Annotation Refinement Using Random Walk with Restarts," Proc. ACM Int'l Conf. Multimedia, pp. 647-650, 2006.
[34] C. Wang, F. Jing, L. Zhang, and H.-J. Zhang, "Content-Based Image Annotation Refinement," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition (CVPR), 2007.
[35] T. Watanabe, H. Tsukada, and H. Isozaki, "A Succinct N-Gram Language Model," Proc. Joint Conf. 47th Ann. Meeting of the Assoc. for Computational Linguistics and Fourth Int'l Joint Conf. Natural Language Processing (ACL-IJCNLP), 2000.
24 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool