Subscribe
Issue No.01 - Jan. (2014 vol.26)
pp: 221-234
Olivier Van Laere , Ghent University, Belgium
Jonathan Quinn , Cardiff University, UK
Steven Schockaert , Cardiff University, UK
Bart Dhoedt , Ghent University, Belgium
ABSTRACT
The task of assigning geographic coordinates to textual resources plays an increasingly central role in geographic information retrieval. The ability to select those terms from a given collection that are most indicative of geographic location is of key importance in successfully addressing this task. However, this process of selecting spatially relevant terms is at present not well understood, and the majority of current systems are based on standard term selection techniques, such as $(\chi^2)$ or information gain, and thus fail to exploit the spatial nature of the domain. In this paper, we propose two classes of term selection techniques based on standard geostatistical methods. First, to implement the idea of spatial smoothing of term occurrences, we investigate the use of kernel density estimation (KDE) to model each term as a two-dimensional probability distribution over the surface of the Earth. The second class of term selection methods we consider is based on Ripley's K statistic, which measures the deviation of a point set from spatial homogeneity. We provide experimental results which compare these classes of methods against existing baseline techniques on the tasks of assigning coordinates to Flickr photos and to Wikipedia articles, revealing marked improvements in cases where only a relatively small number of terms can be selected.
INDEX TERMS
Standards, Encyclopedias, Electronic publishing, Internet, Estimation, Context,feature extraction, Information search and retrieval, knowledge management, artificial intelligence, text mining, metadata, geographic information retrieval, classification, semi-structured data
CITATION
Olivier Van Laere, Jonathan Quinn, Steven Schockaert, Bart Dhoedt, "Spatially Aware Term Selection for Geotagging", IEEE Transactions on Knowledge & Data Engineering, vol.26, no. 1, pp. 221-234, Jan. 2014, doi:10.1109/TKDE.2013.42
REFERENCES
 [1] P. Serdyukov, V. Murdock, and R. van Zwol, "Placing Flickr Photos on a Map," Proc. 32nd Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 484-491, 2009. [2] C. De Rouck, O. Van Laere, S. Schockaert, and B. Dhoedt, "Georeferencing Wikipedia Pages Using Language Models from Flickr," Proc. the Terra Cognita 2011 Workshop, 2011. [3] T. Rattenbury, N. Good, and M. Naaman, "Towards Automatic Extraction of Event and Place Semantics from Flickr Tags," Proc. 30th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 103-110, 2007. [4] L. Hollenstein and R. Purves, "Exploring Place through User-Generated Content: Using Flickr to Describe City Cores," J. Spatial Information Science, vol. 1, no. 1, pp. 21-48, 2010. [5] A. Popescu and G. Grefenstette, "Deducing Trip Related Information from Flickr," Proc. the 18th Int'l Conf. World Wide Web, pp. 1183-1184, 2009. [6] J. Eisenstein, B. O'Connor, N.A. Smith, and E.P. Xing, "A Latent Variable Model for Geographic Lexical Variation," Proc. Conf. Empirical Methods in Natural Language Processing, pp. 1277-1287, 2010. [7] Y. Yang and J.O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization," Proc. 14th Int'l Conf. Machine Learning, pp. 412-420, 1997. [8] M. Rogati and Y. Yang, "High-Performing Feature Selection for Text Classification," Proc. 11th Int'l Conf. Information and Knowledge Management, pp. 659-661, 2002. [9] C. Hauff and G.-J. Houben, "WISTUD at MediaEval 2011: Placing Task," Proc. Working Notes of the MediaEval Workshop, 2011. [10] B. Silverman, Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986. [11] B. Ripley, Spatial Statistics. John Wiley & Sons, 1981. [12] D.A. Smith, "Detecting and Browsing Events in Unstructured Text," Proc. 25th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 73-80, 2002. [13] R. Swan and J. Allan, "Extracting Significant Time Varying Features from Text," Proc. Eighth Int'l Conf. Information and Knowledge Management, pp. 38-45, 1999. [14] H.L. Chieu and Y.K. Lee, "Query Based Event Extraction Along a Timeline," Proc. 27th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 425-432, 2004. [15] Q. Zhao, P. Mitra, and B. Chen, "Temporal and Information Flow Based Event Detection from Social Text Streams," Proc. 22nd Nat'l Conf. Artificial Intelligence, pp. 1501-1506. 2007. [16] G.P.C. Fung, J.X. Yu, P.S. Yu, and H. Lu, "Parameter Free Bursty Events Detection in Text Streams," Proc. 31st Int'l Conf. Very Large Data Bases, pp. 181-192, 2005. [17] O.Z. Chaudhry and W.A. Mackaness, "Automated Extraction and Geographical Structuring of Flickr Tags," Proc. Fourth Int'l Conf. Advanced Geographic Information Systems, Applications, and Services, pp. 134-139, 2012. [18] E. Moxley, J. Kleban, and B.S. Manjunath, "SpiritTagger: A Geo-Aware Tag Suggestion Tool Mined from Flickr," Proc. First ACM Int'l Conf. Multimedia Information Retrieval, pp. 24-30, 2008. [19] N. O'Hare and V. Murdock, "Modeling Locations with Social Media," Information Retrieval, vol. 16, pp. 30-62, 2013. [20] O. Van Laere, S. Schockaert, and B. Dhoedt, "Finding Locations of Flickr Resources Using Language Models and Similarity Search," Proc. First ACM Int'l Conf. Multimedia Retrieval, 2011. [21] Z. Cheng, J. Caverlee, and K. Lee, "You Are Where You Tweet: A Content-Based Approach to Geo-Locating Twitter Users," Proc. 19th ACM Int'l Conf. Information and Knowledge Management, pp. 759-768, 2010. [22] L. Backstrom, J. Kleinberg, R. Kumar, and J. Novak, "Spatial Variation in Search Engine Queries," Proc. 17th Int'l Conf. World Wide Web, pp. 357-366, 2008. [23] E. Amitay, N. Har'El, R. Sivan, and A. Soffer, "Web-a-Where: Geotagging Web Content," Proc. 27th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 273-280, 2004. [24] B. Wing and J. Baldridge, "Simple Supervised Document Geolocation with Geodesic Grids," Proc. 49th Ann. Meeting of the Association for Computational Linguistics. pp. 955-964, http://dblp.uni-trier.de/db/conf/aclacl2011.html#WingB11 , 2011. [25] C. Brunsdon, "Estimating Probability Surfaces for Geographical Point Data: An Adaptive Kernel Algorithm," Computers & Geosciences, vol. 21, no. 7, pp. 877-894, 1995. [26] C.B. Jones, R.S. Purves, P.D. Clough, and H. Joho, "Modelling Vague Places with Knowledge from the Web," Int'l J. Geographical Information Science, vol. 22, pp. 1045-1065, 2008. [27] T. Rattenbury and M. Naaman, "Methods for Extracting Place Semantics from Flickr Tags," ACM Trans. Web, vol. 3, no. 1, pp. 1-30, 2009. [28] T. Dunning, "Accurate Methods for the Statistics of Surprise and Coincidence," Computational Linguistics, vol. 19, no. 1, pp. 61-74, Mar. 1993. [29] Z.I. Botev, J.F. Grotowski, and D.P. Kroese, "Kernel Density Estimation via Diffusion," Annals of Statistics, vol. 38, no. 5, pp. 2916-2957, 2010. [30] D. Pfeiffer, T. Robinson, M. Stevenson, K. Stevens, D. Rogers, and A. Clements, Spatial Analysis in Epidemiology. Oxford Univ. Press, 2008. [31] P. Haase, "Spatial Pattern Analysis in Ecology Based on Ripley's K-Function: Introduction and Methods of Edge Correction," J. Vegetation Science, vol. 6, no. 4, pp. 575-582, 1995. [32] O. Van Laere, S. Schockaert, and B. Dhoedt, "Ghent University at the 2010 Placing Task," Proc. Working Notes of the MediaEval Workshop, 2010.