IBM Crowdsources Translation Software
by George Lawton
IBM researchers have created n.Fluent, software that translates text between English and 11 other languages. Now they’re improving it by enrolling IBM's bilingual employees in crowdsourcing, a technique in which a large group of participants independently make small contributions to a larger project.
"We started with this vision we could leverage the IBM multilanguage work force," said David Lubensky, computer scientist at IBM's T.J. Watson Research Labs. "We have 400,000 employees in 170 countries to help us with customizing our technologies."
n.Fluent works as a plug-in or add-on to other applications, such as email or instant messaging. It provides secure, real-time translation from Web pages, electronic documents such as PDF files, and instant message chats. It offers a Blackberry mobile-translation application and currently works with Arabic, Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish.
A core team of about 100 employees has been developing n.Fluent over the past four years. The software went live for internal IBM use in August 2008. Since then, about 3,000 employee volunteers have collectively contributed more than 36 million words to extend and improve it. To encourage further participation and raise awareness of the project, IBM held its first crowdsourcing event last summer.
No specific plans have been announced for a commercial product or service. "For right now, we're focusing on building and perfecting the tool," said Ari Fishkind, an IBM spokesman.
Salim Roukos, computer science researcher at IBM's T.J. Watson Labs, believes this sort of technology could play a big role in localizing global operations. Companies currently spend about $13 billion a year to translate documentation, which is all done using human labor. With n.Fluent, companies could automate the first translation and then let humans focus on correcting any mistakes.
Statistical Machine Translation
Other automated translation techniques use a rule-based approach, in which a linguist designs a set of rules for translation. With statistical machine translation, the software learns by comparing the same text in different languages. Roukos said IBM pioneered statistical machine translation in the 1990s, and it’s been the main automated approach over the past 3 to 5 years.
The n.Fluent developers used United Nations' proceedings as the early training set. The UN proceedings are translated into six languages, so they offer a good basis for building statistical models. Machine learning also benefits from human corrections to the translations on this corpus of parallel content.
Philip Resnik, associate professor of computer science at the University of Maryland, has been researching crowdsourcing machine-translation techniques. Resnick said statistical techniques made a revolutionary leap by turning a labor-intensive, expert-driven development process into a machine-learning problem.
"The biggest problem in translation is not the failure to find a way to translate something," Resnick explained. "It's finding a good way when too many possibilities present themselves. The space of possible translations is combinatorially huge; statistical methods provide ways to navigate that space in order to find your way to good translations."
Crowdsourcing
Last summer, IBM launched its first two-week translation challenge event to enroll bilingual employees in donating their expertise. The company awarded individual points to translators and converted them to dollars that were donated to one of seven charities on the employee's behalf.
The company recently launched its second challenge, which is approaching 2 million words, compared with 1.3 million in the previous challenge. "The first challenge was experimental," Lubensky said. "We didn't know what to expect. It took a lot of oiling the machinery to get the word out."
In the current challenge, the researchers identified community leaders for each language to help recruit more participants and increase motivation. "These events are one of the key ways to get a community of employees involved," said Lubensky.
One challenge of public crowdsourcing applications is keeping anonymous users from polluting the system, from either carelessness or malevolence. Roukos said his research team managed this problem by having employees sign in using their company credentials. This improves the data quality, although it still requires checking and monitoring the results to catch mistakes attributable to standard human error.
Resnik said that IBM's use of crowdsourcing is innovative in its focus not on end-to-end translation for human consumption but on feedback for the translation system. "This is a nice approach," he said, "because instead of just training your system on whatever data happen to be available, you can seek human corrections specifically for the kinds of data that give your system the most difficulty."
One limitation of crowdsourcing today is that only bilingual participants can make improvements. In many situations, the volunteer base that speaks only one language will be far larger. For example, Wikipedia has fewer than 800 translators, but it has 75,000 volunteers actively contributing to its content. To address this gap, Resnik is working with Ben Bederson, associate professor of computer science at the University of Maryland, to develop a framework for human-machine interaction that pairs volunteers, one of whom knows only the source language and the other, only the target language.
George Lawton is a freelance technology writer based in Monte Rio, California. Contact him at glawton@glawton.com.