The Community for Technology Leaders

In the News


Pages: pp. 5-9

Abstract—Researchers are working on a technique that enables a humanoid robot to learn basic language skills by communicating directly with people. Google and Stanford University scientists have developed a large, cloud-based, deep-learning neural network suitable for complex tasks such as object recognition and machine translation. Scientists have developed an approach designed to let robots work with in concert with humans in factories and other settings.

Keywords—robots, robotics, training, MIT, Julie Shah, machine learning, machine vision, networking, deep learning, neural network, object recognition, Google, Stanford University, cloud computing, robot, speech, language, corpus linguistics, University of Hertfordshire, DeeChee, Caroline Lyon, ITALK, iCub, RobotCub Consortium

Researchers Teach Robots to Speak


Scientists are working on a humanoid robot that develops basic language skills by communicating directly with people.

The University of Hertfordshire researchers are using corpus linguistics—the learning of language based on real-world samples—and direct verbal feedback from humans to teach robots to speak. This occurs in the same way a child learns to talk by listening to others.

So far, the DeeChee robot has learned 24 simple spoken words describing concepts such as colors, shapes, and sizes. However, the robot does not understand the meaning of the words yet.

The experiments ( demonstrate the viability of teaching robots speech via interaction, rather than by feeding them pre-existing language models, as has traditionally been the case, explains Caroline Lyon, a visiting research fellow with the University of Hertfordshire's Adaptive Systems Research Group.

The research could also provide insight into the human language-learning process, she adds.


Lyon has conducted her research as part of the European Union's Integration and Transfer of Action and Language Knowledge in Robots (Italk) project. Italk seeks to develop agents embodied in robots that could use interactive human feedback to learn behavioral, cognitive, and linguistic skills.

DeeChee is based on the open source iCub robot, which the European Commission–funded RobotCub Consortium of academic and industry researchers designed for AI and robotics research.

According to Lyon, humans have one set of brain pathways that relates sounds to meaning and another that relates sounds to sound patterns and word forms. Her work—which focused on preliminary word-form acquisition—implemented the latter sort of pathway in a robotic system. This approach relates to research that has found a connection between an infant hearing sounds and then repeating them as a facet of learning language.

The University of Hertfordshire project could yield numerous benefits in areas such as robotic speech, language learning, and human-robot interaction.

The Experiment

The researchers used simple machine-learning techniques to implement the basic language-acquisition algorithm. The approach was based on learning via positive reinforcement and the way that children pick up and repeat the spoken sounds they hear.

In the experiments (see Figure 1), participants sat in a room with DeeChee for an eight-minute training session. Lyon says the sessions were short to keep the trainers focused.


Figure 1   University of Hertfordshire researchers are teaching their DeeChee robot to speak by having it communicate directly with people. Humans described objects to the robot using a 24-word lexicon.

During the sessions, participants interacted with objects on a table and described them to the robot using a lexicon of 24 words. The robot recorded the audio and converted it into a stream of phonemes—basic units of sound—using an adapted version of Microsoft speech API 5.4. Using phonemes enabled the system to parse speech into a form that could be processed and recorded into a frequency table.

One of the participants would then show DeeChee an object. The robot would babble random phonemes from the ones it had heard back to the human. When the robot made one of the sounds used in the word that describes the object, the participant responded with approving comments such as "well done," "good," or "clever." The researchers programmed the robot to give higher statistical weight to those sounds in the learning process.

Lyon says her project experienced several challenges. For example, the robot's performance varied widely based on the trainer. In addition, some people had difficulty distinguishing the words in the robot's babble and could not give meaningful feedback. Other participants talked too much or gave DeeChee inaccurate feedback, reducing the robot's speech accuracy.

Speaking of Benefits

The University of Hertfordshire researchers say their approach's main benefit is that it could create accurate, easy-to-train robots and computerized speech-interaction systems.

According to Lyon, the goal of her work is to understand more about linguistic communication and the connection between word usage and language learning, as well as human-robot interaction and language-acquisition skills. In addition, the project could advance the creation of robots that act as human companions.

Lyon says further research will try to teach the robots word meanings.

Google Brings Deep Learning to the Cloud


Google and Stanford University researchers have built a huge cloud-based deep-learning neural network that has already accomplished complex object recognition, an important goal for researchers. They developed their network across multiple servers in numerous datacenters

that are part of Google's existing cloud infrastructure, showing that large deep-learning systems could be built with commodity hardware.

Big neural networks could accomplish numerous goals, but scaling up such systems has been difficult because, for example, orchestrating the interactions among the many nodes has proven challenging. The Google and Stanford researchers developed several approaches to deal with such problems.

They built a neural network across 1,000 machines with 16,000 processor cores. Most similar research has focused on a single computer with just a few hundred cores. "We wanted to focus on larger models than people had done in the past," says Google Fellow Jeff Dean.

Once the scientists developed their system, they used it for complex object recognition to show how well it works.

The Google/Stanford project could enable powerful neural networks that could undertake deep-learning tasks such as speech recognition and machine translation.

The Promise of Deep Learning

Deep learning promises better ways to work with many types of data for purposes such as identifying images, speech recognition, and identifying scenes and types of action in a movie.

Deep-learning research began in the early 1990s but was hampered by limited computing power and immature algorithms, says Yann LeCunn, New York University professor of computer science and neural science.

Traditionally, neural-network use has required supervised training, in which an operator presents the system with a collection of labeled data to make learning easier. However, most online data, such as images, are not labeled, says Google's Dean.

A key to using neural networks with such data is to build larger systems running better algorithms, which was a goal of the Google/Stanford project. However, scaling neural networks for deep learning has been a significant challenge, says Stanford associate professor Andrew Ng, director of the school's Artificial Intelligence Lab.

One challenge is the volume of traffic that occurs in large deep-learning networks during the update process, Dean explains. Other obstacles, he adds, include creating neural-network applications for large systems, developing an architecture for pushing them out to hundreds of thousands of servers, and harvesting the results.

"Within academia, most researchers were training models with 1 million to 10 million parameters," Ng says. The Google/Stanford scientists, he noted, trained a network with 1 billion parameters.

"As far as anyone has been able to evaluate these models, the bigger a network we can build, the better the performance," he adds. "The fact that we were able to train a massive network was key to making it work well."

The Google/Stanford System

The Google/Stanford system runs on commodity multicore PC servers linked by Google's high-speed Ethernet connections. It uses a job scheduler designed to improve efficiency in cluster systems.

The software infrastructure automatically splits the neural network model across multiple machines to efficiently distribute workloads. The system also uses algorithms to maximize communications within each server to reduce traffic across the network. The researchers deployed nodes likely to communicate on a given problem within a single server, to minimize the problems caused by large amounts of neural-network traffic.

A key goal was to upgrade the neural network's feature extractor, which lets the system evaluate data without labeling or other human assistance, says New York University's LeCunn. The Google and Stanford researchers accomplished this by improving the application-provisioning process to enable stronger feature extraction.

The network has three neuron layers, with each refining the work of the previous one. For example, when studying images, the initial layer distinguishes basic features such as edges and corners. The additional layers recognize increasingly more complicated structures until the system can pick out objects such as faces or cats, says Dean.

This progressive approach enables the level of learning that neural networks haven't been capable of before.

To test their system, the Google and Stanford researchers used their neural network to automatically create models—which could subsequently run on relatively few servers—of faces, body parts, and cats using unlabeled images, says Stanford's Ng. The system used these models to identify other images.

The scientists eventually created neural networks that classified unlabeled images into 20,000 different categories of objects.

The system recognized a small number of objects with 81 percent accuracy and 20,000 different types with 15.8 percent accuracy. Dean says this was a high level of precision for such a system.

Future Work

An increasing number of organizations could adopt the Google/Stanford approach because they could now afford to run neural networks on large numbers of servers, which are more affordable than in the past.

According to Dean, his research team has been working closely with Google's speech, natural language processing, and machine translation teams to use the new approach in projects.

He explains, "We have a continuum of short-term projects that will impact Google [speech-recognition and other] products soon. And in the long run, we are working on more open-ended research problems."

New Technique Lets Robots Work with Humans in Factories


MIT researchers have developed AI-based techniques that promise to let robots learn to work side-by-side with humans in factories, assembly lines, and other settings.

The scientists are testing various training models to teach robots to work with people and adapt to different work styles, says MIT assistant professor Julie Shah, leader of the Interactive Robotics Group in the university's Computer Science and Artificial Intelligence Laboratory.

Robots have worked in factories for almost 50 years, but they are usually large machines that perform dangerous or repetitive work or tasks involving heavy lifting. They generally are located by themselves, frequently in caged-off areas, because their strength, speed of movement, and limited awareness of humans pose a safety hazard.

To remedy this, the MIT scientists are using an approach that lets robots train with workers. Past research in this area has focused largely on the complex programming of robots, rather than having them learn by interacting with people.

Robots in the Factory

The MIT scientists are using prior research that focused on robots receiving positive and negative feedback—in the form of verbal and button-pushing cues that the machines were programmed to recognize—from humans.

Shah says she and her team have also utilized recent US military research indicating that having each participant learn all processes involved in a project is an effective training approach. She says the MIT scientists applied the same principle in their work, with each robot and human partner practicing different parts of a coordinated task as a team.

This gave both the human and the robot a better idea of how to support each other, she explains.


In their experiments, the MIT researchers taught a multipurpose industrial robot to drill and patch holes in concert with a human (see Figure 2). Their approach used machine vision to parse the motions of humans performing chores into individual behaviors.

Graphic: MIT assistant professor Julie Shah (left) and her research team teach a robot to drill holes, as part of an AI-based approach to letting robots work in concert with humans in factories and other settings.

Figure 2   MIT assistant professor Julie Shah (left) and her research team teach a robot to drill holes, as part of an AI-based approach to letting robots work in concert with humans in factories and other settings.

Machine-learning algorithms then map this data to a decision tree, with each branch representing a part of the task that a worker might undertake.

This process lets the robot recognize the steps required in each task and the order in which the coworker prefers to perform them. It also calculates the human's next likely action.

"On the factory floor, every person does things a little differently," Shah explains, "They work differently, so you cannot train a robot for just one person."

As part of the research, the MIT scientists used a motion-capture system to let the robot accurately and quickly track human coworkers wearing body suits covered in LED lights.

Using this information, machine-vision algorithms recognize when the robot might hit a human. The system then limits the machine's movement to keep this from happening.

Shah notes that having people wearing body suits with lights wouldn't be practical in most real-world work settings.

Virtual Training Environment

The researchers have developed a virtual training environment that is like a first-person 3D computer game.

The interface lets a user navigate a 3D environment on a computer screen and simulate performing different tasks.

Within the environment, the robot can prepare itself by practicing with a virtual assembly-line worker.

Future Work

The MIT researchers next hope to develop a more immersive 3D environment with stereoscopic goggles and gloves that will let people train robots to work with them on particularly intricate tasks such as lifting a heavy object. However, better safety techniques are necessary before robots can be used in these types of advanced interactions.

Rensselaer Polytechnic Institute professor emeritus Steve Derby explains, "A robot is a dumb, blind beast." He said robots can be much more powerful than their specifications for various tasks indicate, which could "lull one into a false sense of security."

For example, he says, a robot specified to have a carrying capacity of five pounds might actually exert 80 pounds of force, which could be dangerous to a worker not expecting so much power.

Finding ways to cope with such issues is necessary before robots can work effectively with people in factories, according to Derby.

Shah predicts that ongoing research by various labs and robot vendors will address safety concerns about mechanized coworkers in the near future, perhaps via better real-time human-tracking capabilities.

"We are fortunate to work with industry partners," she notes, "and our plan is to see this technology on the factory floor in the next three to five years."

Shah says that better robot-human interaction could allow machines to perform mundane chores such as picking up objects or cleaning a worksite. This would let humans work more efficiently by focusing on specialized tasks that would be difficult for robots to learn.

65 ms
(Ver 3.x)