Corpus Linguistics Keep Up-to-Date with Language
by George Lawton
Language is a living thing that changes regularly. New words are added, and others become archaic. Words take on new meanings, and some old uses fall out of fashion.
This can leave traditional dictionaries years behind in terms of current language usage, said Erin McKean, founder of Wordnik, a corpus-linguistic database.
Creating traditional dictionaries is a huge project, with linguistic experts and others studying the language to determine which additions and changes are appropriate. Thus, new dictionaries can appear only every few years.
For more up-to-date information, a growing number of researchers and organizations are turning to corpus linguistic (CL) techniques to analyze large quantities of text and speech and automatically generate language databases. These databases provide information about words' various meanings, usages, and contexts.
CL applications send bots to automatically search news feeds, blogs, and other sources on the Internet — as well as previously curated and tagged bodies of text — for new instances of language usage.
They also provide more metadata than natural-language-processing (NLP) techniques, as well as information about the sentiments associated with various word usages, and the context in which words have certain meanings.
Researchers are beginning to use CL tools to improve dictionaries, language learning, and the sorting of large document sets based on various criteria.
In the future, those tools could enable more accurate speech recognition and speech synthesis, said University of Georgia professor William Kretzschmar.
Computer processing and storage advances are enabling systems to analyze and index ever larger datasets, noted University of Pennsylvania linguistics professor Mark Liberman.
Categorizing the data and its associated metadata enables CL systems to find more precise relationships among words, which leads to more accurate document processing and speech recognition.
Linguistic Databases in Detail
CL emerged from British linguistics professor John Firth's research in the first half of the 20th Century.
This work led to the release of the one-million-word Brown University Standard Corpus of Present-Day American English in the 1960s.
Interest in CL techniques was minimal for years because there wasn’t enough computational power and storage available to maximize their effectiveness.
Interest increased as processing and storage capabilities grew, leading to the Collins Birmingham University International Language Database, a University of Birmingham research facility established in 1980 and funded by William Collins & Sons publishers. The facility released the English COBUILD Dictionary in 1987.
The BNC Consortium, an industrial/academic organization led by the Oxford University Press, completed the 100-million-word British National Corpus in 1994.
In 2008, Google produced a linguistic dataset from Google Books with 155 billion words but relatively little metadata, noted the University of Georgia's Kretzschmar.
Technology
CL applications use bots that aggregate online material such as news feeds, transcripts of spoken presentations, academic journals, and books.
They also use data-mining tools — such as Oxford University Press Wordsmith Tools or the open source AntConc — that search through curated material, scan textual datasets, and categorize word usage across multiple domains, cultures, and individuals.
A linguist can curate content by deciding which sources to sample, a process that could involve manually collecting material or sending bots to specific websites. Curators must select useful sets of sources to ensure that their CL linguistic databases are effective.
After they categorize words, CL programs add relevant metadata and then organize the collection into a linguistic database.
In some cases, the programs generate indexes that produce faster responses to queries but that could also increase tenfold the amount of material in a CL database, noted Brigham Young University (BYU) linguistics professor Mark Davies.
CL techniques determine definitions by analyzing words near, as well as next to, the target words in text. This differs from NLP, which generally analyzes only words adjacent to the targets.
More processing power is thus required to assemble and index a CL database.
However, said the University of Georgia's Kretzschmar, this approach makes it easier to find relevant definitions and improves text processing, speech recognition, and speech generation.
Wordnik
The Wordnik database combines CL and NLP techniques to generate a dictionary with about 6 million fully indexed and 50 million partially indexed English words.
This compares with about 600,000 words in the Oxford English Dictionary, one of the largest traditional English dictionaries.
To update its database, Wordnik uses its own technology that analyzes multiple news feeds and data gleaned from various news, cultural, and scientific sources.
The technology processes the data into word graphs stored in a MongoDB nonrelational database. These graphs present visualizations of the relationship among collections of words and thus are similar to social-network graphs, Wordnik’s McKean explained. They let users more easily look for and identify associations among words.
The company provides an API that lets organizations and individuals design programs that access the Wordnik database for various purposes, such as researching word usage and providing websites with a better dictionary for their content.
Corpus of Contemporary American English, 1990–2011
BYU's Mark Davies and his team have amassed the Corpus of Contemporary American English, 1990–2011, the world's largest open source CL database.
COCA has 425 million words and adds 25 million new entries annually. It is not as large as some commercial CL databases, such as one that Ebsco Publishing has developed.
COCA's database consists of five categories with 85 million words each: fiction, academic, popular culture, news, and spoken English.
Metadata about entries' literary genre, context, and era of popularity help identify word-usage trends.
COCA also lets users see words that were located near a new entry in textual sources, to help them identify how the new word is utilized in context.
According to Davies, COCA is used primarily for teaching English; conducting academic linguistics research; and improving human-computer interaction by, for example, enabling better speech recognition and more realistic text-to-speech synthesis.
Most of COCA's data is available for free at www.wordfrequency.info. A richer set of information with more metadata is also available for commercial licensing at www.wordfrequency.info/purchase.asp.
Talking about the Future
A hurdle to widespread CL-database adoption is that the technology can require considerable manual effort to curate content and tag information with metadata, Kretzschmar said.
Indexing CL-database content is also a challenge. Research is thus focused on developing better algorithms and parallel-processing techniques for indexing.
The University of Pennsylvania's Liberman said better techniques are also needed for determining the sentiment a word reflects or its usage in different contexts, and for word disambiguation and associating individual words with multiple meanings in different settings.
The University of Georgia's Kretzschmar predicted that CL will enable many types of speech-recognition and speech-synthesis applications because the metadata the technology provides will allow greater linguistic precision.
He also said companies will use CL techniques to scan and then sort documents for compliance with governmental record-storage, auditing, and reporting requirements.
According to BYU's Davies, CL's Holy Grail lies in analyzing large datasets of informal language used on social-networking sites such as Facebook and in everyday users' day-to-day online communications.
However, he said, it will be at least 15 years until the available computing horsepower, storage capacity, and analysis techniques will enable this.