Issue No.05 - September/October (2005 vol.7)
Published by the IEEE Computer Society
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MCSE.2005.106
What do elephants and humans have in common? According to Marquette University's Patrick Clemins, individuals in both species have distinct voices, and we can train computers to recognize 85 percent of what they say. Running preidentified elephant sounds into a PC-based hidden Markov model (HMM) algorithm trains that program to then recognize and classify further elephant recordings. In short, combine a herd of six elephants, a digital tape deck, and a basic workstation running a HMM, and you get the start of a great conversation.
What do elephants and humans have in common? According to Marquette University's Patrick Clemins, individuals in both species have distinct voices, and we can train computers to recognize 85 percent of what they say. Running preidentified elephant sounds into a PC-based hidden Markov model (HMM) algorithm trains that program to then recognize and classify further elephant recordings. In short, combine a herd of six elephants, a digital tape deck, and a basic workstation running an HMM, and you get the start of a great conversation.
Clemins' bioacoustics team at Marquette hooked up with Disney's Animal Kingdom to test this acoustic analysis technology (see the " Dr. Dolittle, I Presume?" sidebar for why they chose elephants). Being able to identify which particular elephant is talking (the speaker model) and which sound or part of elephant speech an elephant is using (the vocalization type) are both important steps toward elephant speech translation. The team's HMM software can uniquely identify individual elephants in a herd by their distinct rumbles with more than 88 percent accuracy. "I don't think the individual speaker-identification task would be possible for humans to perform," says Kirsten Leong, a research associate at the Animal Kingdom.
To identify the vocalization type, the HMM software has a 94 percent success rate in immediately classifying new recordings into categories previously defined by the human researchers. In comparison, when researchers learned to identify vocalization types, "it usually took a couple weeks of intensive listening to tapes before we reached 85 percent," Leong estimates.
The software's accuracy not only bests human efforts for elephant speech, but it's also comparable with the same task—using the same approach and software—done on human speech.
When communicating, elephants often group quietly for several minutes, then emit clusters of rumbles, trumpets, croaks, revs, snorts, and other sounds. In all, the elephants make from one or two to as many as 20 distinct vocalizations in a single minute. Then, they return to silence.
"Elephants almost always 'talk' at the same time," Leong notes. Although understanding "elephantese" will require that we understand social context, language, and grammar, the pure acoustic task of identifying vocalizations and individual speakers is a necessary first step.
Leong says that vocal, visual, tactile, and hormonal communication are all part of elephant "speech." Additionally, their vocal range is 7 to 200 Hz, compared to the human ear's range of 20 Hz to 20 KHz, so humans can't hear much of an elephant's speech.
"Over 85 percent of the elephant calls we recorded were rumbles, which … have the added complication of containing infrasonic components [which humans can't hear]," Leong says. "Blue whales are an example of another species with similar vocal challenges for researchers."
Name that Sound
The team's automated approach takes individual extracted vocalizations and applies a trained HMM to identify the vocalization in nearly real time. Given a sequence of distilled data reductions—in this case, coefficients derived from logarithmically rescaled fast Fourier transforms of the elephant vocalizations—the researchers built and trained a Markov chain to analyze new sound sequences.
This chain is a series of steps in which the probability for each next step is based only on the current state of that step's portion of the data, with no dependence on what the previous data was or which chain steps processed it so far. Do enough "walks" to train the chain, and it settles into an equilibrium that you can use to identify new, unknown samples. Unlike rules-based systems, in which humans have to discover and then codify what makes each vocalization unique, an HMM simply requires the researchers to label different vocalizations according to type and then let the software decide how to uniquely differentiate and distinguish the types. Once Clemins' team trained the chain, they ran new vocalizations through a Veterbi pattern-matching algorithm to rapidly classify them into previously identified speaker identifications and vocalization categories, producing a log of "which elephant said what."
The HMM identification software runs on an ordinary (2-GHz) workstation using Cambridge University's freely downloadable HTK 3.1.1 software and Clemins' own Java additions and scripting. Among the additions was a silence model, which defined "not speaking" as a valid part of speech so that the HMM didn't miscategorize initial pauses or lulls in the vocalization data.
It took the software just five minutes to train the speaker ID experiment 150 times with 150 known vocalizations. To provide the training data, the team gave each elephant its own microphone, which not only greatly reduces noise but also ensures that the training data for identifying individual elephants is accurate. Even so, numerous false positives for the shorter vocalizations and missed rumbles coincident with environmental noise occurred in the automated tests.The need for initial human processing might suggest further room for error. Fortunately, Leong says that in her experience, human error in classifying vocalization types—unlike that in identifying speakers—is nearly zero. Still, the researchers needed several human-tagged samples of the herd.
"The more training data, the better generalized the models are and the more accurate the classification," Clemins says. "To a bioacoustics person, large amounts of data means hundreds of individual calls. To a speech person, this means hours of continuous speech consisting of hundreds of thousands of words. So, there is a definite gap here. The main disadvantage for us is [the] lack of high-quality, noise-free training data."
A big hurdle for real-time vocalization translation remains the initial extraction of each vocalization. The software expects discrete elephant vocalizations and can't handle a raw stream of sound. Thus, separating out the vocalizations from the raw sound tapes also requires a human operator—in this case, Leong. Disney's Wildlife Tracking Center has real-time spectrogram software, which scrolled the audible and infrasound spectrograms and let Leong identify, tag, and dump individual vocalizations into digital files. This step, which takes about twice as long as simply listening, is therefore the slowest part of the recording process. Despite these hurdles, Clemins is looking to the technology's future. "I don't think it would be a stretch to say that the algorithm could be implemented in real time on a Palm Pilot or Pocket PC, although we haven't tried this." Clemins says they've used the same algorithms as the speech-recognition programs available for such platforms.
Even real-time work would require a training period in which users label each discrete vocalization, but as the library of processed vocalizations grows, the software would encounter fewer unknown or uncertain matches. "Eventually, the software could be set to just run and pick the best choice and analyze a whole tape in near real time," Clemins says. As users fix the mistakes, the software learns and improves.
Supporting the Science
Leong says that although all the rumble vocalizations they used in the experiment sound essentially the same to the human ear, their speaker ID work supports the claim of distinct voices. Many researchers believe elephants do have distinct voices, but there is thus far little strong scientific evidence to directly support this stance.
Animal Kingdom's Joseph Solti believes the animals have unique voices, and provides physical evidence using principal components analysis (PCA) of vocalization formant frequencies. For example, the dominant frequency of an elephant's voice is influenced by the shape of its vocal tract. Clemins notes that Solti's use of a more traditional bioacoustics method has a lower accuracy, but other inherent advantages.
"The main drawback to our method is that it's hard to visualize in a spectrogram what cepstral coefficients are and how the vocalizations are different," Clemins says. "With Joseph's paper, he was able to point out that the location of the formant frequencies varied the most between the speakers using PCA. This kind of result gives bioacoustic researchers something to visualize—cepstral coefficients do not."
Given the real-time extraction and computer identification of elephant vocalizations, a "dictionary" of elephant sounds and the ability to play back samples and communicate with these animals might be a possibility—at least Clemins would like to think so. However, he points out, significant hurdles remain. "I'd be surprised if it doesn't take years for us to unlock the various meanings of each vocalization. Our software only matches labels [elephant names and vocalization types] to vocalization data," he says. "We need humans to try and understand what these labels might be."
However, the computerized system does provide an unexpected benefit. "In the course of our research, we discovered two new types of vocalizations that had never been described before," Leong says. "If we were relying on an automatic classification scheme, we would have missed those entirely."