2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS) (2015)
Dec. 11, 2015 to Dec. 13, 2015
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DSDIS.2015.30
This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
Context, Pragmatics, Semantics, Syntactics, Internet, Electronic mail, Data models
A. Drozd, A. Gladkova and S. Matsuoka, "Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora," 2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS), Sydney, Australia, 2015, pp. 61-68.