2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService) (2016)
Oxford, United Kingdom
March 29, 2016 to April 1, 2016
Political event data have been widely used to study international politics. Previously, natural text processing and event generation required a lot of human efforts. Today we have high computing infrastructure with advance NLP metadata to leverage those tiresome efforts. TABARI -- an open source non distributed event-coding software -- was an early effort to generate events from a large corpus. It uses a shallow parser to identify the political actors, but ignores semantics and relation among the sentences. PETRARCH, the successor of TABARI, encodes event data into "who-did-what-to-whom" format. It uses Stanford CoreNLP to parse sentences and a static CAMEO dictionary to encode the data. To build dynamic dictionaries, we need to analyze more metadata such as the token, Named Entity Recognition (NER), co-reference, and many more from parsed sentences. Although these tools can code modest amounts of source corpora into event data they are too slow and suffer scalability issues when we try to extract metadata from a single document. The situation gets worse for other languages like Spanish or Arabic. In this paper, we develop a novel distributed framework using Apache Spark, MongoDB, Stanford CoreNLP, and PETRARCH. It shows a distributed workflow by using Stanford CoreNLP to extract all the metadata (parse tree, tokens, lemma, etc.) from the news corpora of the Gigaword dataset and storing it to MongoDB. Then it uses PETRARCH to encode events from the metadata. The framework integrates both tools using distributed commodity hardware and reduces text processing time substantially with respect to a non-distributed architecture. We have chosen Spark over traditional distributed frameworks like MapReduce, Storm, Mahout. Spark has in-memory computation and lower processing time in both batch and stream processing compared to other options.
Sparks, Encoding, Dictionaries, Metadata, Engines, Software, Data mining
M. Solaimani, R. Gopalan, L. Khan, P. T. Brandt and B. Thuraisingham, "Spark-Based Political Event Coding," 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService)(BIGDATASERVICE), Oxford, United Kingdom, 2016, pp. 14-23.