Issue No.06 - November/December (2006 vol.21)
Published by the IEEE Computer Society
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIS.2006.115
The first news story, "AI Plus Art: DaVinci Would Love It," looks at interactive art that incorporates artificial intelligence techniques. The second news story, "In Silico Vox: Speech Recognition on a Chip," reports on an attempt to create the first chip that can handle speaker-independent, large-vocabulary speech recognition.
AI Plus Art: DaVinci Would Love It
A new wave of researchers with concurrent interests in artificial intelligence and art or music are creating breakthrough, interactive works for audiences and creative tools for artists. Of course, visionary inventors have combined a passion for art and science for centuries. Leonardo DaVinci would probably be captivated by today's AI research—perhaps even more so by the possibilities for art that AI opens up.
A painting you can relate to
Imagine if the Mona Lisa could tell you were having a bad day and gave you a bigger smile. For John Collomosse, a lecturer at the University of Bath's Department of Computer Science, interactive art that adjusts to suit the viewer's emotional state isn't fiction but reality.
He and two research colleagues, Maria Shugrina and Margrit Betke, both at Boston University's Computer Science Department, have developed an "empathetic painting" project. Using an everyday webcam and software they developed, their system analyzes a viewer's expression and adjusts art (displayed on a computer monitor) accordingly in real time—say, changing its hue or intensity. (See what it looks like at www.cs.bath.ac.uk/ ~vision/empaint.)
Essentially, Collomosse says, the project uses computer vision algorithms to look at various aspects of your facial expression (eyebrows, for example) and map this information to a 2D space representing emotions. This information then feeds a rendering algorithm, which changes the artistic image.
The empathetic-painting team completed their project's first phase last summer in Bath. The two Boston University researchers are now working on the second phase—making the vision components more robust so that they'll work with a wider range of people, Collomosse says.
He hopes the work will lead to novel ways for viewers to interact with art, as well as advance other research involving artistic rendering algorithms. While those algorithms live today only in academia, Collomosse says the huge library of digital photographs that almost everyone is amassing these days will create demand for artistic products that let users do new and different, perhaps painterly, tasks with those images. He hopes that the algorithms he's developing can be fine tuned and optimized for various artistic tasks.
Plus, he'd like to turn the software loose on the Web and see what artists can do with it. "We'd like to see if this can inspire new creative processes," he says.
A more interactive accompanist
Musicians have long worked with Music Minus One recordings, which, for example, might give an oboe player a chamber orchestra accompaniment track for a Vivaldi concerto. But there's always a disconnect as the live player waits for the recorded accompaniment to begin. Moreover, the player ends up following the recorded music's cues. This is the opposite of what happens with a live orchestra, which takes cues from the soloist and learns from him or her during repeated practices.
This static form of accompaniment doesn't help the musician improve performance skills. So, Christopher Raphael, an associate professor of informatics at Indiana University, seeks to develop an accompaniment system that listens to the player's music and learns from him or her for future sessions. "I'd like this to be a part of a musician's toolkit, the way a metronome is," he says.
To make the technology viable, Raphael faces two main problems. First, his system must hear the music being played and match it with the score's information about, for example, pitch, note duration, and transitions. To do this listening and matching, his system uses a statistical method involving hidden Markov models.
Second, the system must get the timing right, which involves complex problems of rhythm: No musician wants the accompaniment to sound mechanical and stilted. His work models the problem using hundreds of random variables represented via a graph, a Bayesian belief network. "The only way you can coordinate parts is predicting into the future what you'll hear," Raphael says. (For more details, see http://xavier.informatics.indiana.edu/ ~craphael.)
What are Raphael's biggest challenges with the project? First, he continually tries to enhance the performance's musicality. "If you're playing musically (live) and I follow you, I can piggyback [pick up on a soloist's style] a certain amount," he says. Not so with a recording—yet.
Raphael is also trying to improve the synthesis of the recorded music, so that, for example, the first attack on a note isn't distorted.
Mimi Zweig, a professor at Indiana University's Jacobs School of Music, has used Raphael's system with some of her strings students for the past year. "It has given my students the opportunity to rehearse and play with an orchestra that can actually follow them," Zweig says. "This is an invaluable experience for the development of young musicians who otherwise may have to wait many years to play with an orchestra."
While Raphael hopes the technology will help get more people involved with music, he also hopes it creates fresh ground for composers.
"My system allows you to play music you couldn't do another way," he says. For example, a composer can create a combination of orchestral and electronic music, with the electronic music making "superhuman demands" on what a person could do with an instrument, Raphael says.
"The door is open to a new kind of music no one has composed yet," he says. "With electronic instruments, anything goes, but it tends to sound sterile. This is a means to bring together the two worlds."
Robotic bagpipers and hidden beats
For Roger Dannenberg, an associate research professor at Carnegie Mellon University's School of Computer Science and School of Art, art meets AI daily—and it sounds right out of Scotland. Dannenberg continues to improve the software that fuels McBlare, a robotic bagpiper that he and his students have been developing for about two years. (Take a look at McBlare at Dannenberg's site, www.cs.cmu.edu/~rbd.) Dannenberg, who was trained as a trumpeter, performed with McBlare in Scotland in the summer of 2006.
AI is helping Dannenberg tackle one of his big technology challenges—improving the ornamentation in McBlare's playing. "One of the things we've been doing with the control software is looking at MIDI [musical instrument digital interface] files of bagpipe music that other people have prepared, and we're pulling out ornamentation. One of the main characteristics of bagpipe music is the ornamentation, which is very fast," Dannenberg says. "Then we're taking new music for McBlare to play and inserting ornamentation into the transitions." AI helps match the ornamentation and transitions, he says.
The biggest problem with McBlare now isn't with the AI but with regulating the air supply, a problem for which Dannenberg hopes to create a mechanical solution.
Dannenberg has previously worked on computer accompaniment systems and musical-style classification systems. He has also composed and performed pieces such as "In Transit," which demonstrate how music generation algorithms can help computers accompany musicians in intelligent ways, evolving on the fly during an actual performance based on the musician's work. In this way, the PC is interacting with the musician to better please the audience.
Ultimately, he'd like to try to teach computers to classify music and even improvise, as human players do. Recently, his research has included work on what he calls a standard but important problem in music recognition: getting a PC to find the beat in music. "It seems like a simple problem, but it's difficult for machines," Dannenberg says.
Rock music, for example, often masks drum beats behind stronger vocals. "The simple beat information is really buried in the signal," he says. So how do humans figure out where the beat is? "My belief is humans are listening not only to low-level stuff like drums but higher-level stuff like lyrics and repetitions," Dannenberg says.
From an AI perspective, you have knowledge about the structure and signal and other patterns, but to do good beat tracking, you have to combine all the information. To build a PC that can truly interactively play along with other musicians' tempo, Dannenberg and other AI researchers will have to break down the high- and low-level music information and create new models to connect it.
Looking at music from multiple perspectives (such as beat tracking and structure) will likely do a better job in getting a machine to recognize and create music. The same type of challenge—integrating knowledge from many different perspectives—is one that AI researchers across many disciplines today are tackling.
Will we see more combinations of AI with interactive art and music in the future? "There's no doubt about it," Raphael says. "It's certainly catching on in academia. There's a deep intellectual interest."
For researchers like these three, not only the advancement of AI but also the desire for better art fuels the continued work.
"Composers are always looking for new sounds," Dannenberg says. "Regardless of how much AI is involved, if you get the computer to generate music for you, it's likely to generate something new—that gives you new ideas."
In Silico Vox: Speech Recognition on a Chip
Carnegie Mellon University and University of California, Berkeley, researchers are working on developing the first chip that can handle speaker-independent, large-vocabulary speech recognition. Their In Silico Vox project, partly funded by the US Department of Defense, could help intelligence services with high-speed "audio mining" of wiretaps and other defense-related technology. If the chip design proves commercially viable, it could have far-reaching implications. Speech recognition could become a standard feature of small devices such as remote controls and cell phones, liberating users from complex hierarchical menu structures.
Leading the project is Rob Rutenbar, professor of electrical and computer engineering at CMU. "We have a working 1,000-word naval command-and-control recognizer running right now in an FPGA [field-programmable gate array]," Rutenbar says. However, it's only a proof-of-concept version. He adds, "We're currently working on a 5,000-word recognizer, mapped to one large FPGA, that should be running five to six times faster than real time."
Performance is crucial
According to Rutenbar, future speech recognition applications will require a hundred to a thousand times the current technology's performance. Although PC-based systems have improved in recent years, Rutenbar argues that advanced speech software programs running on general-purpose processors aren't designed to handle the demands of speech recognition efficiently. "Moving the computationally demanding tasks of speech recognition to silicon leads to faster, and ultimately cheaper, solutions," he says.
Custom silicon, Rutenbar adds, "can tailor arithmetic precision to match the demands of the application, deploy as much or as little parallelism as the task warrants, and optimize memory to meet very specific bandwidth needs. A general-purpose processor, on the other hand, cannot."
Another issue is power constraints. General-purpose chips consume too much power to provide speaker-independent, large-vocabulary speech recognition on platforms such as cell phones.
Like graphics processing, but more so
Inspiration for this project came from computer graphics hardware, which works much faster than an equivalent software solution. Graphics hardware uses arithmetic units that speed up common, fairly straight-forward rendering operations (such as perspective, light calculation, and window clipping). But speech recognition poses a different set of problems. For example, speaker-independent, large-vocabulary speech recognition must distinguish not only different sounds (and similar but different sounds such as the "t" in words like "two" and "butter"), they must also deal with variations in speaking rates, articulation, and pronunciation. Such requirements make stringent computational demands on any hardware implementation.
Speech recognition systems consist of an acoustic front end, feature scoring, and back-end search. The front end converts audio signals into phonemes ("sound bits"), feature scoring identifies the phonemes, and the back end compares combinations of phonemes to sounds used in words. It locates a word in its memory that's associated with a string of phonemes, applies a language model, and produces a match for what a speaker is saying.
Rutenbar and his colleagues have determined that the time spent converting audio signals to phonemes is negligible. To speed up feature scoring and search, the In Silicon Vox chip uses what Rutenbar calls "computing units" ("'processors' is a rather loaded word") to work on the various stages of recognition simultaneously. Rutenbar explains, "There's a big messy dynamic data structure getting updated concurrently by matching at the lowest level (acoustics, sounds), at the medium level (say, for all the words in a 1,000-word vocabulary) and at the top level (relationships among sequences of words that are more likely to appear). Things at the higher levels of recognition are feeding back information, dynamically, that is reshaping the way we calculate probabilities in the earlier stages, for later speech input."
Rutenbar gives an example of the process: "Once we start recognizing certain likely sequences of words, we start fetching information about what the 'next' likely words are, from our language model, and getting those ready for matching, should the spoken utterances go in that direction. All these processes are hitting the same big DRAM-based memory where all the speech models live. The parallel computing units in the design coordinate access to a very large, complex set of acoustic and language models living out in that shared memory. In other words, the computing units don't just talk to each other; they have to coordinate to talk to one or a few shared DRAM memory units."
The ultimate goal, says Rutenbar, is a chip that crunches a 50,000-word vocabulary at 1,000 times the speed of real time, but he adds that much work remains to be done.
The case for software
Meanwhile, some analysts question the hardware approach to speech recognition.
Steve Leibson, technology evangelist at chip maker Tensilica, points out that silicon-based systems lead to a loss of programmability. "Application-specific processors and multiple-processor SoCs [systems on a chip] radically redefine what software can and cannot do on a chip," he says. "Retaining programmability is nice when the algorithm changes, which always seems to happen."
Carol Espy-Wilson, associate professor at the University of Maryland's Department of Electrical and Computer Engineering, argues that chip-based natural speech recognition is premature. "Hardware will allow the recognizer to run faster, but speed is not the issue," she says. "Accuracy in recognizing spontaneous speech in everyday situations is the issue."
Rutenbar agrees that a speech recognizer has many idiosyncrasies and subtleties that are critical for good accuracy and good performance. "Before you can build what's called a 'platform version' of something like a recognizer—an architecture that can host a lot of different but related recognition strategies—you have to build a handful of successful dedicated recognizers, and understand what the commonalities are. The idea is to put the essential, performance-critical commonalities in hardware, and try to do something programmable for the remainder. All without sacrificing a lot of performance. We are still in the middle of that research right now."