Somehow time and place got lost. During the last month, I promised myself that I would write my next column for the China Computing Federation as I shuttled from one IEEE meeting to another. That plan got lost in the shuttle from one airport to another. I hope that I am forgiven for the lapse, as movement seems to be one of the themes of this season. Perhaps, in the process, I will learn what needs to be moved and what does not.
Over the course of my travels, I have spent much of my time talking about the phenomena known as big data. Both the business and research communities seem to have realized that not only are we collecting unprecedented amounts of data but that we are attempting to draw conclusions from that, and that very few people know how to draw such conclusions. Clearly there is money to be made and reputations to advance. People are looking for data scientists, a term that has been recently defined, without knowing exactly what that term means.
Both the statistical and data-mining communities have attempted to make the claim that they are data scientists and that the world need look no further to find the individuals who are the true data scientists. Their claims have some validity, but they quickly fall apart when we look at them closely.
The statistical community can trace its origins back to the offices that created the first national censuses. However, they developed most useful methodologies in the middle of the twentieth century. These methodologies were created to handle data sets that were sampled from larger populations or could be modeled with the same kind of probabilistic methods. Statisticians showed that they could draw interesting and accurate conclusions from relatively small collections of data. In many ways, they displaced numeral methods that might be considered the precursors of big data, which were little more than large uncontrolled censuses.
The data-mining community has a slightly stronger claim to being the true data scientists, but they have their origins in a slightly different problem. The tools of data mining came from problems in multivariate statistics. In the mid-1980s, scientists began using computer programs to explore relationships between qualities that were impossible to display on a two-dimensional sheet of paper. These started as graphical methods, but quickly moved into algorithms that would explore data as a human might.
Big data differs from problems addressed by these two groups for a pair of reasons. First, big data tends to deal with data sets that are so large that the probabilistic models of statistics make little sense. How can you make an inference on a statistical model with the variability on the parameters of that model have shrunk to a number that cannot be significantly represented on a computer?
Second, the problems of big data also deal with data sets that are complicated, disjointed, and expensive to manipulate. They are difficult to combine into a simple rectangular form and give to them a data-mining algorithm. They exist on different systems in different places and are expensive to combine. The work of gathering and maintaining these datasets is often greater than that of analyzing them.
Now, the cost of moving data brings us to the problems of travel and movement. At the end of last month, when I should have been writing my column for you, I was sitting in an airport with a friend. We were both waiting for planes to take us home. We had been in the same city to attend a meeting that that had ended earlier than we had planned. Sitting in the airport, we were discussing our various projects for the coming year.
My colleague was working on a problem in cloud computing. He was designing a system that would make multiple machines look and act as a single computer. It was a clever idea, one that went well beyond the common ideas of parallel and distributed processing. He gave me a quick demonstration by portioning his laptop into multiple virtual machines and then running his new software. It was a lovely demonstration, though you had to be able to interpret the stream of numbers that scrolled down the screen.
After the demonstration, our talk naturally turned to the applications of his idea, including big data. At one point, I commented that the bottleneck in his technology seemed to be the data channel. He could easily find himself in a circumstance where his machines would swamp the communications channels that connected them together. At that point, he gave a look that clearly suggested that I had missed an obvious point. “We won’t have to move big data sets,” he said. “We’ll just move the process.”
His comment captured one of the obvious facts of big data. As we process larger and larger sets of data, we will find that these sets are the fixed geography of the virtual world. Data configuration is hard enough when you have to deal with relatively small sets. When you are dealing with large sets that require specific architectures or configurations, you have a much harder and more expensive problem. It will likely be much easier to keep data in one place and move relative small blocks of code than to move entire datasets.
Of course, the idea of moving processes–commonly called process migration–has a long history, a developed theory, and a substantial number of examples of working systems. The early work in the field can be traced back to the late 1970s, but these ideas were not commonly utilized until less than 10 years ago. There was, explained one review of the subject, no “compelling commercial argument for operating system vendors to support process migration.” It only became common when system operations realized that processes were easier to move than other parts of their systems.
Initially, many writers on process migration suggested that they could be used to balance loads across a set of physical computers. While that is certainly one way to address the problem, it may not be the best. It is often easier and safer to move virtual machines, which keeps all the moving processes safe within a familiar computing environment. However, when the immovable part of your work is a large data set, not a machine, you may find that the only reasonable solution is to move your tasks to a new environment, for there may be no other simple way to access the data that you need.
Over the past few months, I have learned the parallel lesson. It is often easier to move the Computer Society President to a new environment than move that environment to the president. Time and again, I have moved to different parts and realized that I was able to glean some information that I could not by sitting in my office and reading reports on the Internet. When you visit a new area, you see the ideas that are immovable. You see the unique ways of organizing work, designing systems, engaging with customers, and fitting research into a laboratory. Only then do you identify the ideas that bind us all together and ideas that are unique to each part of the world.
However, such travel has a price. From time to time, you lose the thread of communication that connects you back to your home. When you do, you forget activities, drop assignments, and forget about your responsibilities, just like a computing task that has migrated from its home environment. In the last round of travel, I lost track of my column for you. I hope that in the process, I better understand the ideas that unite computing—in China, in the US, or anywhere else—into a single field.
About David Alan Grier
David Alan Grier is a writer and scholar on computing technologies and was President of the IEEE Computer Society in 2013. He writes for Computer magazine. You can find videos of his writings at video.dagrier.net. He has served as editor in chief of IEEE Annals of the History of Computing, as chair of the Magazine Operations Committee and as an editorial board member of Computer. Grier formerly wrote the monthly column “The Known World.” He is an associate professor of science and technology policy at George Washington University in Washington, DC, with a particular interest in policy regarding digital technology and professional societies. He can be reached at grier@computer.org.