May/June 2010 (Vol. 14, No. 3) pp. 3-5
1089-7801/10/$31.00 © 2010 IEEE
Published by the IEEE Computer Society
Published by the IEEE Computer Society
Sinking or Swimming in the Sea of Data
|Why Having Lots of Data Is a Good Thing|
|Why Having Lots of Data Is a Bad Thing|
|Data at Internet Scale|
PDFs Require Adobe Acrobat
I recently attended IC's annual editorial board meeting, held this year at Hewlett-Packard Labs in Palo Alto. (I thank our two HP board members, Pankaj Mehra and Dejan Milojicic, as well as HP Labs as an organization, for their hospitality.) Although it's customary to have a speaker from outside the editorial board present a topic of interest, this year we had a tag-team arrangement: first, Rich Sorkin, Zeus Analytics' CEO, talked about his company's approach to using weather forecasting to predict effects on electricity generation, supplies, commodity markets, and so on; then, Amip Shah from HP Labs presented HP's sustainable data center project ( www.hpl.hp.com/research/sustainability.html) and led a tour of the data center.
Both of their presentations spoke to the wonderful things that can be done with oodles of data and plenty of processing and network bandwidth to handle it. Yet, I was struck at the same time by other news items talking about the sort of purposes these data torrents would be put to. Just because you can do something doesn't mean you should.
So, in honor of my trip to the Bay Area, I'll steal a play from Max's Opera Café ( www.maxsworld.com/maxs), which alternates the phrases "This is a good place for a diet" with "This is a bad place for a diet" on their menu and website.
Why Having Lots of Data Is a Good Thing
I'll start by quoting from Rich's description of his presentation:
Weather forecasting touches the day-to-day lives of nearly every consumer and commercial enterprise on the planet. The importance of weather forecasting ranges from the mundane decisions of what to wear or how much time to allow to get to where you're going, to complex decisions including how much reserve electricity to produce in the event the wind dies (and the turbines stop spinning) or it's a hotter day than expected (and air conditioning use spikes).
This forecasting has evolved from art to science over the past 50 years, with improvements driven by better data observation, increased sophistication of physical atmosphere models, more powerful computers, and faster networks (see www.nasa.gov/worldbook/weather_worldbook.html for more). Ongoing improvements are crucial to renewable power supplies, energy efficiency, fuel efficiency in transportation, crop yields, disaster preparedness, and a host of other fundamental socially important issues.
Continuous processing of environmental data permits scientists to react more quickly and accurately to anomalies. Even so, we have a ways to go. The number of people killed or otherwise adversely affected by failing to evacuate New Orleans ahead of Hurricane Katrina demonstrates the desirability of having accurate, believable predictions: many of the people who didn't leave thought the weathermen were crying wolf because there had been dire warnings in the past that proved unfounded. Indeed, the recent enormous earthquake in Chile triggered tsunami warnings in Hawaii, which most people took very seriously only to find it was a nonevent. How many of them will evacuate next time?
HP Labs' work on reducing energy consumption in data centers is similar in the ways that ongoing sensor readings lead to predictions and reactions in real time. We were led into a machine room with hundreds of computers and disks. Traditionally, such a room would be cooled throughout with some number of air conditioning units that would all be working in tandem, and the overall room temperature would be kept low to ensure that no one system overheated. But in the room we saw, individual thermometers sampled the temperatures across the room, and the cooling was adjusted continuously to keep a given area only as cool as it needed to be to prevent failures. Because computers tend to run hotter when they're active than when idle, the temperature-control system can react to the data center when several computers get busy and heat up. Additionally, the system scheduler can take into account the system's environmental state when scheduling new jobs in the first place.
Considering that Greenpeace recently raised objections to cloud computing complexes that consume lots of coal-generated energy (see, for instance, www.roughtype.com/archives/2010/03/greenpeace_in_t.php as well as Greg Goth's article in the Mar./Apr. 2010 IC), energy-aware data centers are arriving just in the nick of time. The temperature control itself doesn't involve the data torrents we see in such applications as weather forecasting; it's just a trickle of sensor data by comparison. But the whole notion of cloud computing is the "sea of data," and we must be responsible in how we move to it.
Why Having Lots of Data Is a Bad Thing
This column's subject matter was prompted in part by reading an article in The New York Times by Catherine Rampell (22 March 2010; www.nytimes.com/2010/03/22/business/22tax.html, with a companion blog post at http://economix.blogs.nytimes.com/2010/03/22/taxes-taxes-everywhere) about how states wanted to tax nonresidents for income earned for only a single day.
She says, for example,
If you live in Boston but spend one out of 250 workdays this year in New York, you owe New York income taxes on 1/250th of your salary …. The states, for their part, say better techniques for tracking tax deadbeats, not pressure to fill their budget holes, have prompted them to become more vigorous at enforcing the provision.
In other words, a tax that used to be enforced by scouring the sports pages for news about athletes coming to the state and earning many thousands of dollars in a single day could now be enforced by looking at such things as passenger manifests and corporate reimbursements:
In some cases, auditors check to see if, say, an employee who was reimbursed for airfare to California also had California income taxes withheld from his paycheck. If not, the company can be fined.
In Rampell's article, an accountant compares this to Orwell's famous book, 1984. Although I wouldn't take the comparison that far, I note that it's scary to think that states are holding individuals responsible for paying taxes to a large number of jurisdictions. For the past several years, I lived and worked in different states and had to file two state tax returns every year — something that costs time and money. As I filed this year's return, I was thankful that I would return to dealing with a single state for the foreseeable future. And now this! But there might be hope, as Rampell's blog states:
This mess of (occasionally conflicting) state laws can be confusing, not to mention expensive, for companies and their employees to comply with.
That is why some big corporations have thrown their weight behind the Mobile Workforce State Income Tax Fairness and Simplification Act [ www.cost.org/WorkArea/DownloadAsset.aspx?id=73302]. The bill, which has been introduced twice in recent Congresses, would make state nonresident income tax laws uniform, and impose a 30-day minimum threshold before employers would be required to withhold taxes for that state.
Now it seems like they're missing something here, as the blog posting also mentions the difference between how many days must elapse for an employer to withhold taxes and how many days must elapse for an employee to owe taxes. It seems that even if the employer doesn't withhold these other taxes, there are states in which an employee is liable from the first day; extending this threshold must address the employees' concerns as well.
Note to governments far and wide: just because technology enables something doesn't mean it's a good idea!
Data at Internet Scale
I'll give one more quick example about why lots of data can be a bad thing. A few years ago, there was a case of AOL releasing "anonymized" Web search logs for researchers' use. It took only a short time to demonstrate that a set of queries by an "anonymous" user could be mapped to a real person ( www.nytimes.com/2006/08/09/technology/09aol.html). Not much later, researchers did something similar for Netflix data ( www.securityfocus.com/news/11497); although Netflix planned to go ahead with a new contest using real user data, it recently canceled the contest due to privacy concerns.
Obviously, countless other examples of enormous data streams and repositories exist; this has been covered previously in IC from time to time, such as the Nov./Dec. 2008 issue on data stream management, and we're planning a new theme issue on Internet scale data management in early 2012. Many challenges are ahead, including performance, privacy, and something that might be in the shortest supply of all — wisdom.
Selected CS articles and columns are also available for free at http://ComputingNow.computer.org.
The opinions expressed in this column are my personal opinions. I speak neither for my employer nor for IEEE Internet Computing in this regard, and any errors or omissions are my own. I appreciate helpful comments from Barry Leiba and Misha Rabinovich, and our meeting speakers, Rich Sorkin and Amip Shah.