Issue No. 03 - May-June (2013 vol. 17)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIC.2013.45
Peter Mika , Yahoo Labs
It was only a matter of time before a phrase as popular as "big data" turned into a conference. This year, we'll see the first big data conferences appearing, including one officially organized under the auspices of IEEE ( www.ischool.drexel.edu/bigdata/bigdata2013/index.htm), this magazine's publisher. For us researchers, this means that we're once again facing the inevitable questions: Do we attend? Shall we submit a paper? And if we do, does it have to be something that isn't just original, but is also different from what we normally write about?
A Matter of Size?
I can't answer these questions without admitting that I'm confused about the motivation behind big data conferences: it's as though librarians started organizing a conference series on "big libraries." Some librarians do in fact work in libraries with many books; some don't. Similarly, we all work with data, but the size of our datasets is mostly a given. I work with data that's considered "big" by most standards, as do many of us in the Internet space, where data is abundant. At the same time, I hope that at such a conference, no one would have a paper rejected for having too small a dataset. The real imperative should be to request that authors use realistic datasets — that is, ones that might be small but that capture the essence of the problem at hand.
Size doesn't always make a qualitative difference either. Certainly, documented cases exist, such as automated translation, in which researchers have exploited abundant, cheap training data in a machine-learning setup to achieve a perceivable difference in outcomes. At the same time, an equal number of situations exist where we wish we had less (or at least cleaner) data, because adding the wrong data to a machine-learning problem only makes things worse once the machine starts to learn the noise in the data. The aim of machine learning is in fact to compute a model of the data that's much, much smaller in size than the original input.
Of course, you could say that those projects that aren't helped or hampered by large data are exactly the ones that don't belong at a big data conference. In practice, this can be difficult to determine. I often encounter papers that work with large datasets and implicitly claim that the results occurred because the authors have collected and successfully exploited a larger dataset. The justification is often missing, however. The papers either lack scale-down experiments or fail to demonstrate that scaling the data further up is the most economic way of improving result quality. I'm hopeful that reviewers at big data conferences will be specifically instructed to look out for these issues and test whether big data is in fact a key factor in the results.
A Matter of Efficiency?
Even if conferences disregard data size, we must certainly consider methods related to efficient data processing. Here, I'll make a detour to point out that big data has become inextricably interlinked with the notion of cloud computing, and the word "cloud" will no doubt appear in the title of many submissions. However, working at a company that runs some of the largest Hadoop clusters in the universe, I can attest to the fact that it's easy, and even tempting, to write highly inefficient code in MapReduce and kin. Reviewers will thus need to be on the lookout for cases in which researchers' only claim to big data is that they use a platform for large data processing.
That said, efficiency is clearly important when dealing with big data if you're on a budget time- or resource-wise. This isn't new, however. In my primary research field, the Semantic Web, dataset sizes have increased steadily over the years from millions to billions of triples, which has already triggered novel research — for example, thinking about efficient indexing and query optimization, stream processing, and approximate methods for reasoning. Other fields I follow, such as information retrieval and databases, have reached this stage much earlier and turned to such highly practical engineering issues as using GPUs instead of CPUs or combining in-memory and on-disk storage. In fact, one of the premier database conferences got its name from very large databases (VLDB) many, many years ago.
A Matter of Community?
Conceivably, a big data conference is thus a general computer science conference focused on efficiency. But should I submit my paper on efficient indexing of RDF data using MapReduce?
Or should I take my other work on translating SPARQL queries to Pig? As they're written, I'm willing to admit that neither of these papers might be interesting enough to anyone outside the Semantic Web, a community that we can define as the set of people who are interested in the Semantic Web. If you're interested, you're likely already part of the community in the first place. Minimally, I would need to strip my papers of parlance and make them useful for a broader audience.
But what audience? Not having seen a big data conference, this isn't yet known. What's certain is that the challenge for both organizers and participants is that a big data conference should attract researchers from all fields that work with data. Examples of this already exist, although they're more targeted.
I came to know the Sunbelt conference series a few years ago (see www.insna.org/archives.html). Although strongly rooted in the social sciences, Sunbelt attracts researchers of all kinds who share the common trait of working with data that can be meaningfully represented as a graph. Researchers discuss notions that can be relevant independent of the actual data, such as centrality measures, strong versus weak ties, or structural holes. At Sunbelt, particular applications — such as estimating the Web's size or predicting the success of a student based on his or her classroom network — are typically described as they would be anywhere else, but it isn't necessary that you understand them to get something out of the conference. In fact, Sunbelt is more fun and instructive than many other domain-specific conferences I've been to.
Still, even graph-shaped big data is a more specific concept than big data in general. Although it's great for learning about methods, Sunbelt already faces the problem that comparing particular studies' actual outcomes is nearly impossible due to the variety of tasks and data. The conference somewhat eschews this problem by having no paper review process at all, only abstracts. (This also has to do with its grounding in social sciences, where studies often require multiple years to perform, but intermediate results are still interesting to discuss.) But how will we compare results at a big data conference? Will the conference be divided in tracks about graph data, relational data, tree data, and so on, or will it be divided by other means so that we can have clusters of work that are comparable along some other dimension?
A Matter of Logistics?
The final question is whether these conferences, being specifically about big datasets, will require that authors submit their datasets along with their papers. The more real a dataset is, the more likely that it contains sensitive information. But even if all the big datasets used in papers could be opened up by their owners, we lack the infrastructure for supporting publications around them. Much discussion is ongoing about new scientific publishing formats combining text and data ( www.force11.org is a good starting point), but little practical progress has occurred. Minimally, we would need to start assigning global identifiers to datasets and develop a reference scheme that can link publications to the data that's been used. At the moment, finding out which publications use the same standard dataset — for example, finding all published results on a particular Text-Retrieval Conference (TREC; http://trec.nist.gov) data collection — is close to impossible.
Furthermore, before requiring that researchers make these large datasets available, we would need new cloud-based archival functionality. Clearly, EasyChair ( www.easychair.org) isn't designed to handle terabyte uploads. Large public datasets such as the CommonCrawl public Web crawl ( www.commoncrawl.org) are currently shared through private cloud services that charge researchers for storing or processing the data, or both. As an example, the cost of doing a single run of data parsing and extraction from this 40-Tbyte dataset could be several hundred dollars. I'm a firm believer that such a research infrastructure could and should be financed by the publishing and conference-organization industries that monetize research and almost always turn a profit. I wish (big) data conferences would address these issues, but they seem likely to bypass the opportunity to introduce novel publishing formats or create and maintain the necessary infrastructure for publishing in the big data era.
What we know for now is that the calls for these conferences invite papers related to big data theory, big data infrastructure, big data search, big data mining, big data privacy, and so on. Given that this includes most if not all of computer science, I suspect that we'll have to wait for the final answers to all the questions I've posed: my prediction is that the people who come and the papers that researchers submit will eventually define what big data conferences stand for.
And at this point, dear reader, we have come full circle. Will you show up? What will you bring? Respond to me on Twitter using @pmika. Please use 140 characters or less. We're having a hard time keeping up with the Twitter fire hose.
Peter Mika is a senior research scientist at Yahoo Labs in Barcelona, where he works on the application of semantic technology to Web search. Mika has a PhD in computer science from Vrije Universiteit Amsterdam. Contact him at firstname.lastname@example.org.