Pattern Recognition, International Conference on (2010)
Aug. 23, 2010 to Aug. 26, 2010
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICPR.2010.792
A system, called News Stand, is introduced that automatically extracts images from news articles. The system takes RSS feeds of news article and applies an online clustering algorithm so that articles belonging to the same news topic can be associated with the same cluster. Using the feature vector associated with the cluster, the images from news articles that form the cluster are extracted. First, the caption text associated with each of the images embedded in the news article is determined. This is done by analyzing the structure of the news article's HTML page. If the caption and feature vector of the cluster are found to contain keywords in common, then the image is added to an image repository. Additional meta-information are now associated with each image such as caption, cluster features, names of people in the news article, etc. A very large repository containing more than 983k images from 12 million news articles was built using this approach. This repository also contained more than 86.8 million keywords associated with the images. The key contribution of this work is that it combines clustering and natural language processing tasks to automatically create a large corpus of news images with good quality tags or meta-information so that interesting vision tasks can be performed on it.
News images, online clustering, image tags, news image corpus
H. Samet and J. Sankaranarayanan, "Images in News," 2010 20th International Conference on Pattern Recognition (ICPR 2010)(ICPR), Istanbul, 2010, pp. 3240-3243.