Issue No. 06 - November/December (2006 vol. 4)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MSP.2006.154
Michael D. Smith , Harvard University
Simson Garfinkel , Harvard University
Can you find a terrorist in your database? Do the register receipts from discount drug stores hold the secret to stopping avian flu outbreaks before they become epidemics? Can anonymized data be "re-identified," compromising privacy and possibly jeopardizing personal safety?
Government and industry are increasingly turning to data mining with the hope that advanced statistical techniques will connect the dots and uncover important patterns in massive databases. Proponents hope that this so-called data surveillance technology will be able to anticipate and prevent terrorist attacks, detect disease outbreaks, and allow for detailed social science research—all without the corresponding risks to personal privacy because machines, not people, perform the surveillance.
Emergence of data surveillance
The US public got its first look at data surveillance in 2002 when it learned about the Pentagon's Total Information Awareness (TIA) project. The project, one of several sponsored by the Information Awareness Office (IAO) at DARPA, quickly became a lightening rod for attacks by privacy activists, critics of the Bush administration, and even conspiracy theorists. Some of these attacks were motivated by what the IAO proposed; others were based on the fact that the IAO was headed by Admiral John Poindexter, who had a pivotal role in the 1980s Iran-Contra Affair; and still others were based on the IAO's logo—the all-seeing Eye of Providence floating above an unfinished pyramid (similar to the seal on the back of the US$1 bill), carefully watching over the Earth.
Critics charge that data surveillance is fraught with problems and hidden danger. One school of thought says that surveillance is surveillance: whether the surveillance is done by a person or a computer, some kind of violation to personal privacy or liberty has occurred. The database, once created, must be protected with extraordinary measures and is subject to abuse. Moreover, if data surveillance technology says a person is a potential terrorist, that person could then be subject to additional surveillance, questioning, or detention—even if they haven't done anything wrong. After extensive media coverage and congressional hearings, Congress terminated funding for the IAO in 2003.
Data surveillance jumped again to the front pages of US newspapers in May 2006, when USA Today published a story alleging that the nation's telephone companies had shared the telephone records of millions of Americans with the National Security Agency (NSA). According to USA Today, the NSA used this information to create a database of detailed information for every telephone call made within the nation's borders. The spy agency then mined this database to uncover hidden terrorist networks.
A database of every phone call within the US extending years back through time could certainly prove useful for fighting domestic terrorists. If a French male of Moroccan descent were arrested in Minnesota after receiving 50 hours of flight training and a US$14,000 wire transfer from a known terrorist, such a database could provide a report of every person who called or received a telephone call from that individual. This kind of social network analysis would prove invaluable in finding the would-be pilot's comrades-in-arms, but it might also identify his flight teacher, his pizza delivery service, and even the teenager who mows his lawn.
Yet in all of the media coverage of the TIA, the NSA database, and similar projects, many questions seem to be left unasked and unanswered: Does the technology really work—can you find the terrorist in the database? Can you find the terrorist without destroying everybody else's privacy in the process? Is it possible to perform privacy-protecting data mining—at least in theory? And is it possible to turn that theory into practice, or do too many real-world concerns get in the way?
To explore these questions, as well as to improve the public's general understanding of these questions, Harvard's Center for Research on Computation and Society held a daylong workshop in June 2006 on data surveillance and privacy protection. Robert Popp, who served as deputy of the IAO under Poindexter and was a driving force behind the TIA program, gave the keynote address. Popp presented his and Poindexter's vision for using data surveillance for countering terrorism; an article based on that presentation appears on p. 18 of this special issue. (For more information about the workshop, please visit http://crcs.deas.harvard.edu /workshop/2006/.)
Whether data surveillance can find real terrorists and stop actual attacks before they happen is still unproven in the academic literature. Although there's no denying that data surveillance has resulted in arrests, it's not clear if those individuals were actually terrorists or merely people writing novels about terrorists. On the other hand, attendees learned that data surveillance is good for a lot more activities than hunting terrorists.
For example, it might be able to detect and help public health officials contain an outbreak of avian flu. Kenneth Mandl, a researcher at the Harvard Medical School Center for Biomedical Informatics, showed how records of emergency rooms' admissions could anticipate the deaths associated with pneumonia and influenza reported to the US Centers for Disease Control. This isn't tremendously surprising, of course—many deaths result from individuals who didn't seek treatment until it was too late, then went to the emergency room. But what's exciting, Mandl reported, is that pediatric admissions peak roughly a month before adult emergency room admissions. With this knowledge, public health officials could build a system that predicted adult outbreaks by monitoring admissions of children. If outbreaks can be predicted, it might be possible to nip them in the bud.
The Realtime Outbreak and Disease Surveillance (RODS) project at the University of Pittsburgh might one day be able to provide health officials with even better advance warning. The RODS project monitors the sale of over-the-counter cold remedies and other healthcare products in more than 20,000 stores throughout the US. The theory here is that people will attempt to treat themselves with over-the-counter cold medications before they get so sick that they report to the hospital emergency room. Their work so far indicates that sales of these medications peak two weeks before hospital admissions do.
RODS researchers stress that this so-called biosurveillance (the continuous collation and analysis of medically related statistical data) doesn't violate privacy because no personally identifiable information is ever assembled or reported. The system collects only aggregate sales from a sampling of the nation's largest drugstores and mass-merchandise chains. Of course, in the case of an actual avian flu outbreak or bioterrorism attack, it would be helpful to know the names of those infected. But collecting this information isn't necessary to achieve the project's primary goals and would create unacceptable risks to those involved because the data would be so easily subject to abuse.
In his presentation, Mandl discussed five techniques that organizations collecting data could adopt to protect privacy in large-scale data mining efforts. Policies must be set in place to limit access to sensitive data, and then organizations must self-police themselves to make sure that those policies are enforced. When possible, data subjects should be given the ability to exert personal control over their own information. Data should be de-identified whenever possible. Finally, says Mandl, all data must be stored with encryption, so that it's protected in the event of a breach.
All of Mandl's techniques assume that sensitive information will be collected and ultimately used in a highly controlled environment. Indeed, this is a model familiar to most people today. Law enforcement, national intelligence organizations, businesses, and even journalists collect a lot of sensitive information during the day-to-day course of their work and then carefully control who can access the information and how they can use it. Big fences and background investigations are an unfortunate but necessary part of the security model for these organizations.
But another approach on the horizon is to use advanced algorithms and cryptographic theory to avoid the problem in the first place. Algorithms and systems now under development make it possible to collect information in a kind of predigested form that allows some queries (but not others) and that makes it impossible to recover the original, undigested data. The US National Science Foundation-funded project on Privacy, Obligations, and Rights in Technologies of Information Assessment is developing a host of tools and technologies to enable this kind of privacy-preserving data mining. Other work is being done at the University of California, Los Angeles', Center for Information and Computer Security.
In this special issue of IEEE Security & Privacy, we present two articles based on presentations at our workshop. In addition to the article from Popp and Poindexter, we have an article by Jeff Jonas, founder of IBM's Entity Analytic Solutions division, who explains the process of entity resolution—a technique by which names in different databases are determined to represent the same person.
One of the fundamental technical disagreements between these two articles pertains to the kinds of queries performed. Popp and Poindexter advocate pattern-based queries—for example, a standing query could scan for anyone who buys a large quantity of fertilizer, fuel oil, 55-gallon drums, and then rents a truck. Such queries might find new, spontaneous terrorist groups, but they're also more likely to pick up people who have no evil intent—for example, farmers. Jonas, meanwhile, argues that we should focus our limited resources using relationship information—for example, looking for terrorists by looking for individuals who have nonobvious relationships with other terrorists.
Although many of the intelligence techniques and operational details of the global war on terrorism must necessarily remain secret for them to be effective, we believe that it's both possible and necessary to have an informed academic debate on both the tools and appropriateness of data surveillance. If these techniques are effective, they could be tremendously beneficial to our society in the fight against terrorism, disease, and even economic inefficiency. But if they don't work, we need to know that, too—so that we can spend our limited resources on other approaches.
Simson L. Garfinkel is a postdoctoral fellow at Harvard's Center for Research on Computation and Society. His research interests include computer security, computer forensics, and privacy. Contact him at email@example.com.
Michael D. Smith is the Gordon McKay Professor of Computer Science and Electrical Engineering and the associate dean for computer science and engineering at Harvard's Division of Engineering and Applied Sciences. His research interests include dynamic optimization, machine-specific and profile-driven compilation, high-performance computer architecture, and practical applications of computer security.