LSIF: A System for Large-Scale Information Flow Detection Based on Topic-Related Semantic Similarity Measurement
2015 IEEE / WIC / ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) (2015)
Dec. 6, 2015 to Dec. 9, 2015
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/WI-IAT.2015.2
Information flow detection is dedicated to tracking the dynamics and evolution of Web information spreading across the entire web over time. How to choose a comfortable information granularity to detect and how to track information evolution from one to another are the main challenges. Besides, the technological problem of doing that with a large scale information efficiently is yet to be solved. In this paper, we propose a system approach (LSIF) for a large-scale topic-related semantic information flow detection. We view the sentence as the basic information unit. Moreover, we represent a word or a sentence as continuous high-dimensional vector, which is used for semantic similarity measurement, with the help of word embedding and Fisher kernel. To handle the large-scale information efficiently, we propose a dimension reduction framework called Random Reference Reduction (3R). Furthermore, we adopt a novel clustering algorithm to extract meme -- a piece of information and its variants and analyze how memes evolve. We demonstrate the effectiveness of our approach on two terabyte-level datasets. One is the dataset used by some previous researchers, on which we conducted a series of experiments to evaluate performance. The result shows that our approach is more effective and more efficient comparing with the state-of-the-art methods. The other one is 5 terabyte dataset crawled from 20 Chinese news sites. We visualize the detection results of information flow and exact 9 million memes from the Chinese dataset, which spend about two days.
Semantics, Kernel, Training, Context, Data models, Binary codes, Clustering algorithms
M. Zhao, H. Wang, L. Cao, C. Zhang, H. Yin and F. Xu, "LSIF: A System for Large-Scale Information Flow Detection Based on Topic-Related Semantic Similarity Measurement," 2015 IEEE / WIC / ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, Singapore, 2015, pp. 417-424.