The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2009 vol.31)
pp: 591-606
Josef Sivic , INRIA, WILLOW Project-Team, CNRS/ENS/INRIA UMR, France
Andrew Zisserman , University of Oxford, Oxford
ABSTRACT
We describe an approach to object retrieval which searches for and localizes all the occurrences of an object in a video, given a query image of the object. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject those that are unstable. Efficient retrieval is achieved by employing methods from statistical text retrieval, including inverted file systems, and text and document frequency weightings. This requires a visual analogy of a word which is provided here by vector quantizing the region descriptors. The final ranking also depends on the spatial layout of the regions. The result is that retrieval is immediate, returning a ranked list of shots in the manner of Google. We report results for object retrieval on the full length feature films 'Groundhog Day', 'Casablanca' and 'Run Lola Run', including searches from within the movie and specified by external images downloaded from the Internet. We investigate retrieval performance with respect to different quantizations of region descriptors and compare the performance of several ranking measures.
INDEX TERMS
Object recognition, Image/video retrieval
CITATION
Josef Sivic, Andrew Zisserman, "Efficient Visual Search of Videos Cast as Text Retrieval", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.31, no. 4, pp. 591-606, April 2009, doi:10.1109/TPAMI.2008.111
REFERENCES
[1] http://www.robots.ox.ac.uk/~vgg/research vgoogle/, 2008.
[2] F. Aherne, N. Thacker, and P. Rockett, “The Bhattacharyya Metric as an Absolute Similarity Measure for Frequency Coded Data,” Kybernetika, vol. 34, no. 4, pp. 363-368, 1998.
[3] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. ACM Press, 1999.
[4] A. Baumberg, “Reliable Feature Matching Across Widely Separated Views,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 774-781, 2000.
[5] S. Belongie, J. Malik, and J. Puzicha, “Shape Matching and Object Recognition Using Shape Contexts,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp. 509-522, Apr. 2002.
[6] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Proc. Seventh Int'l World Wide Web Conf., 1998.
[7] O. Carmichael and M. Hebert, “Shape-Based Recognition of Wiry Objects,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 12, pp. 1537-1552, Dec. 2004.
[8] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval,” Proc. Int'l Conf. Computer Vision, 2007.
[9] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-Based Object Tracking,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564-575, May 2003.
[10] V. Ferrari, T. Tuytelaars, and L. Van Gool, “Simultaneous Object Recognition and Segmentation by Image Exploration,” Proc. European Conf. Computer Vision, vol. 1, pp. 40-54, 2004.
[11] K. Grauman and T. Darrell, “The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features,” Proc. Int'l Conf. Computer Vision, vol. 1, pp. 357-364, Oct. 2005.
[12] C.G. Harris and M. Stephens, “A Combined Corner and Edge Detector,” Proc. Fourth Alvey Vision Conf., pp. 147-151, 1988.
[13] J. Lafferty and C. Zhai, “Document Language Models, Query Models, and Risk Minimization for Information Retrieval,” Proc. 24th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 111-119, 2001.
[14] V. Lepetit, P. Lagger, and P. Fua, “Randomized Trees for Real-Time Keypoint Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 775-781, 2005.
[15] T. Leung and J. Malik, “Representing and Recognizing the Visual Appearance of Materials Using Three-Dimensional Textons,” Int'l J. Computer Vision, vol. 43, no. 1, pp. 29-44, June 2001.
[16] T. Lindeberg and J. Gårding, “Shape-Adapted Smoothing in Estimation of 3-D Depth Cues from Affine Distortions of Local 2-D Brightness Structure,” Proc. Third European Conf. Computer Vision, pp. 389-400, May 1994.
[17] D. Lowe, “Local Feature View Clustering for 3D Object Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp.682-688, Dec. 2001.
[18] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[19] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust Wide Baseline Stereo from Maximally Stable Extremal Regions,” Proc. British Machine Vision Conf., pp. 384-393, 2002.
[20] K. Mikolajczyk and C. Schmid, “An Affine Invariant Interest Point Detector,” Proc. Seventh European Conf. Computer Vision, vol. 1, pp.128-142, 2002.
[21] K. Mikolajczyk and C. Schmid, “A Performance Evaluation of Local Descriptors,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 257-263, 2003.
[22] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool, “A Comparison of Affine Region Detectors,” Int'l J. Computer Vision, vol. 65, nos. 1-2, pp. 43-72, 2005.
[23] K. Mikolajczyk, A. Zisserman, and C. Schmid, “Shape Recognition with Edge-Based Features,” Proc. British Machine Vision Conf., 2003.
[24] D. Nister and H. Stewenius, “Scalable Recognition with a Vocabulary Tree,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 2161-2168, 2006.
[25] S. Obdrzalek and J. Matas, “Object Recognition Using Local Affine Frames on Distinguished Regions,” Proc. British Machine Vision Conf., pp. 113-122, 2002.
[26] S. Obdrzalek and J. Matas, “Sub-Linear Indexing for Large Scale Object Recognition,” Proc. British Machine Vision Conf., 2005.
[27] P. Ogilvie and J. Callan, “Language Models and Structured Document Retrieval,” Proc. Initiative for the Evaluation of XML Retrieval Workshop, 2002.
[28] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object Retrieval with Large Vocabularies and Fast Spatial Matching,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007.
[29] J. Pilet, V. Lepetit, and P. Fua, “Real-Time Non-Rigid Surface Detection,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 822-828, June 2005.
[30] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical Recipes in C, second ed. Cambridge Univ. Press, 1992.
[31] M. Richardson, A. Prakash, and E. Brill, “Beyond PageRank: Machine Learning for Static Ranking,” Proc. 15th Int'l Conf. World Wide Web, pp. 707-715, 2006.
[32] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce, “Segmenting, Modeling, and Matching Video Clips Containing Multiple Moving Objects,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 914-921, 2004.
[33] F. Schaffalitzky and A. Zisserman, “Multi-View Matching for Unordered Image Sets, or “How do I Organize My Holiday Snaps?” Proc. Seventh European Conf. Computer Vision, vol. 1, pp.414-431, 2002.
[34] F. Schaffalitzky and A. Zisserman, “Automated Location Matching in Movies,” Computer Vision and Image Understanding, vol. 92, pp. 236-264, 2003.
[35] C. Schmid and R. Mohr, “Local Greyvalue Invariants for Image Retrieval,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 5, pp. 530-534, May 1997.
[36] C. Silpa-Anan and R. Hartley, “Localization Using an Imagemap,” Proc. Australasian Conf. Robotics and Automation, 2004.
[37] J. Sivic, M. Everingham, and A. Zisserman, “Person Spotting: Video Shot Retrieval for Face Sets,” Proc. Int'l Conf. Image and Video Retrieval, pp. 226-236, 2005.
[38] J. Sivic, F. Schaffalitzky, and A. Zisserman, “Object Level Grouping for Video Shots,” Int'l J. Computer Vision, vol. 67, no. 2, pp. 189-210, 2006.
[39] J. Sivic and A. Zisserman, “Video Google: A Text Retrieval Approach to Object Matching in Videos,” Proc. Int'l Conf. Computer Vision, vol. 2, pp. 1470-1477, Oct. 2003.
[40] D.M. Squire, W. Müller, H. Müller, and T. Pun, “Content-Based Query of Image Databases: Inspirations from Text Retrieval,” Pattern Recognition Letters, vol. 21, pp. 1193-1198, 2000.
[41] D. Tell and S. Carlsson, “Combining Appearance and Topology for Wide Baseline Matching,” Proc. Seventh European Conf. Computer Vision, pp. 68-81, May 2002.
[42] T. Tuytelaars and L. Van Gool, “Wide Baseline Stereo Matching Based on Local, Affinely Invariant Regions,” Proc. 11th British Machine Vision Conf., pp. 412-425, 2000.
[43] M. Varma and A. Zisserman, “A Statistical Approach to Texture Classification from Single Images,” Int'l J. Computer Vision, vol. 62, nos. 1-2, pp. 61-81, Apr. 2005.
[44] M. Varma and A. Zisserman, “Unifying Statistical Texture Classification Frameworks,” Image and Vision Computing, vol. 22, no. 14, pp. 1175-1183, 2005.
[45] I.H. Witten, A. Moffat, and T. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999.
29 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool