Pages: pp. 6-9
Smart networks of cell-phone cameras will be able to essentially do the same things as regular camera networks—for example, object recognition and surveillance. However, they'll bring these functions to ubiquitous cell phones instead of specialized platforms. Researchers are already developing applications for these networks. But how useful will these applications be?
David Lowe, a University of British Columbia computer science professor, says that "the most promising application is for cell phone users to access information by taking a picture." According to Lowe, a user could take a picture of a building and use it to determine his or her location or access tourism information about that location. Or that person could use a picture of a movie or concert poster to access reviews or previews. In addition, the user could take pictures of products (such as a CD cover or wine label) and obtain information on those products or even order them. Lowe says that for some types of search, this approach might be "far easier and more accurate than trying to search for information using keywords."
One such application is the Pocket Supercomputer, being developed by Accenture Technology Labs ( www.accenture.com/Global/Services/Accenture_Technology_Labs/default.htm). Fredrik Linaker and his colleagues there recently demonstrated a prototype cell phone camera network that uses a cell phone video clip as an alternative to text search strings on ordinary cell phones. Linaker envisions the technology eventually being used for such applications as online price comparisons and book reviews.
The Pocket Supercomputer uses the scale-invariant feature transform (SIFT) algorithm, developed by Lowe at UBC, to search the cell phone's server for images in its database.
According to Lowe, "the essential technology is the ability to match images to large databases and identify the same object or location, even under different lighting conditions and viewpoints." He says that over the past five years, computer vision's ability to match images has improved to the point where it can identify correct matches in databases containing thousands or millions of images. Lowe says that he designed SIFT "to allow matching that is invariant to a range of image transformations, such as change in image resolution, rotation, viewpoint, and lighting."
Microsoft also has two smart-network projects in development that use cell phones. According to Marc Smith, a senior research sociologist at Microsoft Research, Aura is essentially a cell phone with a bar code reader that uses photographs of barcodes to identify objects. Smith adds, "There's another project at Microsoft Research called Lincoln. You just take a picture of the book, and we'll find the cover in Amazon for you. It uses image recognition in one dimension." Lincoln's algorithm analyzes and creates signatures for pictures using a small amount of data. Instead of comparing large numbers of features individually, the algorithm creates data triplets from groups of three features in the picture. The search algorithm then compares these data sets with those in a database. Information on Lincoln, headed by Larry Zitnick, is available at http://lincoln.msresearch.us/Lincoln/Logon.aspx.
However, Lowe says that Evolution Robotics is the only company that has developed a working system in this area. The system, based on a proprietary version of SIFT, lets users purchase products and obtain product information by inputting a picture of images from magazines, catalogs, and packages (see http://analytica1st.com/analytica1st/2007/07/image-recognition-search-engine-fetches.html).
According to Fredrik Linaker, systems that require users to install special software on the phone are a hurdle for widespread adoption. The Pocket Supercomputer therefore uses a feature that's already built into the phone—namely, video calling. The video calling allows video to be streamed from and to the phone in real time.
For many smart networks of cell phones, camera sensing relies heavily on AI techniques. Linaker says that the two most relevant techniques are "object identification and classification algorithms, through which information can be attached and later retrieved." In the short term, algorithms such as sparse feature-point detection will likely be the fastest and most reliable, says Linaker. According to Linaker, sparse feature-point extraction stores just a few hundreds or thousands of points or regions of an image instead of the millions of pixels required to store the actual image in raw format. He adds, "The best-known algorithm for feature point extraction is SIFT."
Linaker says that in the longer term, achieving general visual intelligence will require visual-processing algorithms based on biological principles.
According to Linaker, the AI algorithms' biggest problem is their computational requirements. "They cannot be run directly on the cell phones and provide instant results for millions of objects," he says. "Deployment is rather more likely to be through a server link."
Moritz Köhler of the Institute for Pervasive Computing agrees that object recognition appears to be the most interesting AI technology for smart networks of cell phone cameras. But he and his colleagues Philipp Bolliger and Kay Römer have taken a different approach to developing Facet, software that lets cell phone cameras act as smart surveillance networks. "Our approach is very much focused on statistical analysis and less on AI," says Köhler.
Köhler says that Facet uses short-range communication (through Bluetooth) to create networks of mobile phones that can "localize and track objects in a high-resolution manner" and analyze events. In Facet, each mobile phone captures a stream of information from its camera, and analyzes the images for evidence of objects entering or leaving the camera's field of view.
However, according to Köhler, Facet doesn't perform object recognition—in particular, face recognition. So, people can't use it to track people or generate data on specific c people. Thus, the system "fully protects people's privacy," he says.
Facet is being developed as an open-source system. There will be no restrictions on who can modify the code, and everyone will be free to experiment with cell phone networks running the code. According to Köhler, potential applications include robot tracking and navigation, and statistical data generation. The source code will be available at www.openfacet.org.
So far, the system has been used only in a testbed at the research lab at ETH Zurich, says Köhler.
Whether these applications are smart enough to become killer apps remains to be seen. For example, Jason Hong, on the faculty of Carnegie Mellon University's Human-Computer Interaction Institute, notes that optical surveillance systems are often compromised by false-positive detections. "If the false positive rate is too high, people start ignoring positive results." Hong says another problem with sensor networks is that inevitably some people will try to scam them with false data.
Hong also sees potential limitations in the use of smart networks for Internet-based object identification. "One of the big issues for Internet searches would be just how good the computer vision is," he says. "If it doesn't work very well, there will be errors. That may lead to negative feelings about the applications." While Hong agrees that taking a picture of a bar code is relatively straightforward, he says trying to take a picture of something that isn't tagged is much more difficult. Other problems that he foresees include slow input speeds, small screens, and the user experience's overall complexity.
Linaker agrees that these and other problems still must be solved, but he sees them more as hurdles to be overcome than shows-toppers. Says Linaker, "Current obstacles involve the scaling up of object databases for visual object recognition, establishment of higher bandwidth and lower-lag server connections through client-server software, the design and choice of which information is to be retrieved upon recognizing an object, and generally in understanding what a user wants when he points his camera toward a particular object. … This is analogous to the Google situation where a user writes a word and pushes the 'I feel lucky' button. But now there's a very small cell phone screen to present one or maybe two results on."
As for emerging smart-network applications, there might be no limit. The University of Cambridge's Eiman Kanjo envisions smart networks of cell phones eventually serving in an array of applications, not all of them camera-based. Her list includes
(Data mules are mobile nodes with wireless communication capability and enough storage to handle data from the sensors in the field.) Kanjo says location tracking and multiplayer games are the most developed areas for cell phone applications; most of the others are still in the research phase.
Microsoft Research's Marc Smith believes he has found an underlying pattern to the development of such smart networks. "I think of these devices as a mechanism for implementing social omniscience," he says. "Social omniscience is the ability to answer your social questions at nearly no cost." He believes that the most successful smart networks will be those that best answer two questions for the user: "Who has what I want; who wants what I have? That is the event loop of existence," he says. "Every organism is executing those two lines of code."
"It's one of the holy grails of science," says Honghai Liu, a senior lecturer at the University of Portsmouth's Institute of Industrial Research. He's talking about the notion that researchers will someday create an AI-endowed robotic hand as dexterous as a human's hands. Yet Liu and other researchers say the achievement is within our grasp. And some believe that when robots do possess usable hands, they'll be able to fully assist humans as coworkers, caregivers, and perhaps even household servants.
Plenty of hurdles remain before that happens, of course. Giving robots true manual dexterity requires that all the techniques of AI come together, Liu explains, "including robotics, knowledge representation, psychology, developmental learning, and so forth." For that reason Liu, like other researchers in the field, is focusing on key portions of the problem.
Liu's focus involves developing software that can learn and apply common human hand movements. Working with Xiangyang Zhu, a professor at Shanghai Jiao Tong University's Robotics Institute, Liu devised a sensor-laden glove that enables its human wearer's motions to be captured by a series of cameras. AI software based on fuzzy qualitative reasoning analyzes the motions. The software can then create instructions derived from the movements it has analyzed. Faced with a task, the software can draw on its knowledge base, "then generate desired numerical trajectories for a robot hand's motion control subsystem," explains Liu.
Learning the smoothly coordinated movements of human hands is a major step. But recognizing an object and understanding how to handle it could prove equally important. That's what Aaron Edsinger and Charles Kemp, researchers at MIT's Humanoid Robotics Group, are working on. One of their experiments involves teaching Domo, a gangly robot they created, to grasp objects that humans hand to it.
The handover actually amounts to a programming shortcut. "When you hand something to a robot, you've done the hard part of understanding what needs to be worked with," Edsinger explains. "You've naturally handed over the object in a way that assists the robot in understanding how to grasp it."
That approach stems from Edsinger's belief that robots and humans will someday work side by side and that—as with seeing-eye dogs—robotic coworkers can be effectively taught by their individual operators. "If a person can do something as simple as point, maybe with a laser pointer, [that action] conveys a lot of information about what's relevant for a task," he says. "It helps limit the scope of the problem for the robot." Pointing and similar interactions are also intuitive for humans. In fact, Edsinger sees them as another kind of man-machine interface, akin to a computer mouse or joystick.
Notwithstanding, Domo still acts with considerable autonomy, relying on Bayesian algorithms scripted in Python. First the robot must identify an object with its camera eye, then compute how best to grasp it. The MIT researchers simplify the robot's work by limiting the visual information it must process. "It's not looking at the full geometry of the object. It's determining—how long is it? Where is the tip of it, and where is the person located?" says Edsinger.
Up till now, progress in robotic hand-eye coordination has been slow. One reason is that it falls within two distinct areas of AI research: processing visual information and coordinating a robot's motor functions. Researchers working on processing visual information haven't been sufficiently in contact with researchers addressing the other skills needed to make a robotic hand work properly.
Andrew Ng, an assistant professor at Stanford's Computer Science Department, faced that problem head on when he spear-headed the creation of STAIR (the Stanford Artificial Intelligence Robot). "One of the motivations behind the STAIR project was to try to reintegrate these disparate threads of AI," Ng says. STAIR looks like a cramped computer desk mounted on wheels. Its simple industrial-looking robotic arm belies the complex and delicate tasks it performs. Responding to voice instructions, the robot can retrieve common office objects such as staplers or coffee mugs from another room and hand them to the person making the request.
To complete the task, Stair must not only differentiate a stapler from a coffee mug or other object, it must also know where to grasp it. To make that possible, Ng and his colleagues input images of common items—everything from books to wine glasses to screwdrivers. They included several examples of each item so that the robot could use Bayesian reasoning to determine the probability that an item to be fetched matched what was in its database. Just as Edsinger had done, Ng used only a few reference points to form a visual construct of each item. However, in this case, the stored constructs also identified a location on each item where it might best be grasped. So when STAIR recognizes a wine glass, it knows to grasp the glass by the stem.
From those humble beginnings, Ng believes that in less than 10 years robots will be able to help humans at home with everyday chores. That would "free up vast amounts of human time that we could use to pursue higher endeavors," he says. And interest in his work from several companies would seem to support that contention. Honda, for example, is funding a PhD student working on Stair's computer vision.
But what about the holy grail of a robotic hand on par with our own? Could a robotic hand ever be made that was dexterous enough to perform a piano concerto, for instance? "It really depends on what quality of performance you are looking for," says Liu. "The exciting technology of robotic hands plus sufficient funding" could make it happen in a few short years, he says.
Figure The Domo humanoid robot. Domo can grasp objects that are handed to it. (photo courtesy of Aaron Edsinger)