The Community for Technology Leaders
2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS) (2015)
Sydney, Australia
Dec. 11, 2015 to Dec. 13, 2015
ISBN: 978-1-5090-0214-6
pp: 61-68
ABSTRACT
This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
INDEX TERMS
Context, Pragmatics, Semantics, Syntactics, Internet, Electronic mail, Data models
CITATION

A. Drozd, A. Gladkova and S. Matsuoka, "Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora," 2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS), Sydney, Australia, 2015, pp. 61-68.
doi:10.1109/DSDIS.2015.30
87 ms
(Ver 3.3 (11022016))