The Community for Technology Leaders
Green Image
Issue No. 10 - Oct. (2016 vol. 28)
ISSN: 1041-4347
pp: 2793-2807
Yating Zhang , Graduate School of Informatics, Kyoto University, Kyoto, Japan
Adam Jatowt , Graduate School of Informatics, Kyoto University, Kyoto, Japan
Sourav S. Bhowmick , School of Computer Science and Engineering, Nanyang Technological University, Singapore
Katsumi Tanaka , Graduate School of Informatics, Kyoto University, Kyoto, Japan
ABSTRACT
Numerous archives and collections of past documents have become available recently thanks to mass scale digitization and preservation efforts. Libraries, national archives, and other memory institutions have started opening up their collections to interested users. Yet, searching within such collections usually requires knowledge of appropriate keywords due to different context and language of the past. Thus, non-professional users may have difficulties with conceptualizing suitable queries, as, typically, their knowledge of the past is limited. In this paper, we propose a novel approach for the temporal correspondence detection task that requires finding terms in the past which are semantically closest to a given input present term. The approach we propose is based on vector space transformation that maps the distributed word representation in the present to the one in the past. The key problem in this approach is obtaining correct training set that could be used for a variety of diverse document collections and arbitrary time periods. To solve this problem, we propose an effective technique for automatically constructing seed pairs of terms to be used for finding the transformation. We test the performance of proposed approaches over short as well as long time frames such as 100 years. Our experiments demonstrate that the proposed methods outperform the best-performing baseline by 113 percent for the New York Times Annotated Corpus and by 28 percent for the Times Archive in MRR on average, when the query has a different literal form from its temporal counterpart.
INDEX TERMS
Context, Semantics, Buildings, Training, Informatics, Libraries, Portable media players
CITATION

Y. Zhang, A. Jatowt, S. S. Bhowmick and K. Tanaka, "The Past is Not a Foreign Country: Detecting Semantically Similar Terms across Time," in IEEE Transactions on Knowledge & Data Engineering, vol. 28, no. 10, pp. 2793-2807, 2016.
doi:10.1109/TKDE.2016.2591008
162 ms
(Ver 3.3 (11022016))