, George Washington University
Pages: pp. 6-8
Abstract—As technology has changed, the fundamental way of extracting meaning from data has also changed.
We're living in a time that discourages wholesale change. I know many people who are trying to squeeze an extra year or perhaps two from a laptop of uncertain age or a server that should rightly be retired. The optimistic promises of Moore's law, which traditionally offered substantial improvements for a very modest expense, don't seem as enticing as they once did. Rather than bear the disruption of a change, many seem to be delaying the process of upgrading machines, installing new software, and converting large data files.
Problems with data conversion seem to be of increasing concern. As one anecdotal example, I recently saw my friend Darrow take an old PDA from his pocket to use in making a note about something we'd been discussing. The device is easily a decade old and can no longer be replaced on the new equipment market. Its operating system has been invalidated by a lawsuit and surpassed by other programs. At one point, Darrow noticed me looking at the device and commented, more apologetically than necessary, that the little computer did everything he needed it to do, so he felt no need to replace it.
Our field has regularly employed mockery and scorn to force people to embrace new hardware or software. I recall a recent meeting, not that long ago, in which a division leader shamelessly derided a group of staff members who clung to an older set of productivity tools. In making his case, he suggested that Steve Jobs would be disappointed if they didn't change their ways and openly invited others who were present to shun the holdouts. To this day, I don't understand why he felt compelled to ridicule those who were using some slightly outmoded programs, but I saw that he was willing to push those employees very hard to upgrade their systems.
There are, of course, many reasons to resist changes. New machines don't always work as advertised, and new programs often are incompatible with older versions. Yet, we're rapidly reaching a point where our plans for the future are dictated less by hardware and software and more by data. We've expended a great deal of effort to make data portable and to allow it to be processed by many different software programs. Yet, we haven't taken the same care to make sure that the lessons from that data can transfer easily from system to system.
Data now personalizes almost every piece of software that we use. My word processor knows the words I'm likely to misspell, my browser knows the kind of sites I like to visit, and my car knows how I tend to approach traffic problems. In perhaps the crudest example, I have about 30 Gbytes of research data, all carefully cross-referenced and linked by the themes I've found in it. I'm concerned that my insights will be lost when I'm ultimately forced to move to a new platform. When I purchased a new computer in celebration of completing my first book a little more than eight years ago, I lost the collection of ideas behind that work. Hence, I'm approaching my next upgrade with great caution.
The problems of upgrading data concern not merely structure and format. If those were the only obstacles, we should easily be able to create some kind of computer program that manages the inevitable transition from the old world to the new. But the problem isn't merely technological. Technology has given us new ways of looking at data; hence, as technology has changed, the fundamental way of extracting meaning from data has also changed.
Punched card tabulators, for example, not only taught us how to deal with large datasets, but also raised fundamental questions about how these datasets might fail to reveal important information. Using punched card tabulators, pollster George Gallup showed that small sets of data, if properly chosen, could do a better job of capturing information than a large set. Others have shown with high-speed data connections how the actions of our distant neighbors shape our immediate lives. Still others have used clever algorithms to find information thought to be forever hidden in deep layers of noise, the forgotten voice that was overwhelmed by the scratches on a recording.
Each of these advances has required us to give up an old way of looking at data. Gallup faced substantial opposition from critics who believed that he simply wasn't talking to enough people when he gathered data about public opinion. He had to make the case that it was more important to manage the information that could be derived from a small set of data than to compile a large dataset that would be unmanageable. The resistance that Gallup experienced has been repeated as we've embraced new tools for studying data such as rendering, learning, and data-mining algorithms.
Slightly more than a decade ago, a punch to the chest taught me that new methods of extracting meaning from data could produce physical resistance. The punch was two-fingered but landed right on my sternum. The blow wasn't enough to do substantial damage, but it stung enough to tell me that the person who delivered it was upset and felt that he was being ignored.
The confrontation occurred at an early data-mining conference. I had raised a little money and had recruited one of the pioneers in the field to talk about his methodology to a group of selected statisticians, policymakers, and computer scientists. The algorithms he developed implemented ideas that are common to the field: they processed large multivariable datasets and searched for relationships within the data. However, while the algorithms resembled traditional statistical methods, they completely ignored both the field's philosophy and its probability models.
It was that disregard of traditional statistical models that instigated the punch. The person involved was a senior statistician named Harvey. I had only known him for a year or two, but I thought of him as a sweet, grandfatherly sort. I had invited him to the meeting because I thought he might be interested in new ideas.
Initially, Harvey was quite jovial and appeared to enjoy meeting those who were attending the conference. However, as the talks progressed, he became increasingly restless. He twisted in his chair and made comments to a neighbor that could be clearly heard across the room. Finally, he approached me at the afternoon break, clearly quite angry.
"This is all wrong," he said. "This is the most dangerous thing you could have done."
Although I was grateful to learn something about the cause of his anxiety, I wasn't particularly interested in listening to him. I was running the meeting and had other things to do. As I recall, we were in the process of changing the venue for dinner, having just discovered that the place where we were to dine had been closed because the owner had been arrested for running guns to an ethnic conflict on the other side of the globe. Although I'll admit that it was clearly intended to make him go away, I said something that I thought was respectful.
"You've got to understand," he said, "this data mining isn't science. It's mindless calculation. In real science, you have to formulate a hypothesis, build a model, gather data, and test the results. You don't just calculate every equation that comes into your mind or the mind of your computer. This undermines everything we've done for 40 years."
If Harvey had stopped at this point or I hadn't been so dismissive, we might have had a fair fight. Karl Popper and Bruno Latour would have been escorted into the center ring and had a contest worthy of the ages. However, this wasn't a conflict about ideas. It dealt with the validation of a career. Although Harvey had an office and was involved in research, he was no longer the leader that he once had been.
Harvey's voice rose again. I moved away. The punch landed—two fingers on the sternum.
It was nothing more than a burst of emotion. His anger subsided instantly. Apologies were made. Forgiveness was offered and accepted. However, Harvey opted not to remain for the rest of the seminar.
Even without attending the rest of the talks, Harvey had gauged the impact of data mining better than most of the people who were at the seminar. No one seemed to have witnessed Harvey's punch or, if they did, they were too polite to mention it. At the dinner, they talked about the nature of data-mining methods, the kinds of structures they were capable of detecting, and the class of problems that might benefit from these methods.
I recall that the statisticians talked about how data mining would need to be pulled within the statistical framework. They would need to develop traditional statistical models, articulate the correct notion of probability, and make proper statistical inferences. At the end of the night, I paid the tab and watched the group disperse into the night. The statisticians departed in one direction and the computer scientists in another.
Most commentators have noted that data mining and statistical methods both deal with the analysis of data but they have come to occupy independent spheres. "In this work," wrote one of the conference attendees, "there is little probability modeling of random variation in the data." He made some effort to connect data mining to the work of an early scholar whose writings were so broad that they encompassed many scientific fields. He argued that the kinds of problems data mining was addressing had once been the domain of statisticians but that the "center of gravity of research" had shifted to computer science.
Data mining is only one of the new ways of learning lessons from data. We now deploy a host of learning algorithms to count clicks, track motion, compare transactions, and otherwise delve into the symbolic representations of our lives. When one algorithm has run its course or revealed its limitations, we dispose of it and deploy a new one in its place. Once the replacement is completed, we rarely reuse the lessons of the past. All we can do is to feed this new system a set of data and let it learn afresh. We can't transfer lessons learned. We have no common architecture, as we do with hardware, to mitigate the cycles of change.
In describing how automobiles dealt with cycles of change, industry pioneer Alfred P. Sloan noted that the "necessity of change forced us into regularizing change." The auto industry learned to manage change by updating its designs on an annual basis, incorporating its work into a new model. At the moment, however, we have no such concept for data, especially for data that is intended to capture the behavior of individuals and organizations.
Many who deal with large-scale data believe that we're in the fourth stage of the Information Age. The first stage belonged to hardware, the second to software, and the third to communications. This fourth stage in their hierarchy belongs to data.
Recently, I heard a presentation arguing that we not only can ignore the probability models of statistics, we also can ignore the very concept of causality. Large datasets will give us so much information about phenomena, explained the speaker, that we'll be able to make judgments from the relations that our algorithms find. We'll just accept them as true.
It was a lesson in rational positivism that clearly resonated with those present. Individual after individual rose to express agreement with the presenter's comments, to note how quickly we're finding new ways of processing data, and to predict a great future. It was the end of the day, a time to put cares aside. They had spent the morning and afternoon discussing the problems of how to gather data, how to remove anomalous elements, and how to learn from it. At some point, they'll need to address the problems that come from the habit of change, We're still inclined to abandon the old as we take up the new. However, at some point, we may find that such a habit needs to be put aside.