Issue No. 01 - January/February (2012 vol. 16)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MIC.2012.10
Sam Madden , Massachusetts Institute of Technology
Maarten van Steen , VU University Amsterdam
In the past few years, many areas of computer science have been abuzz with talk about "big data." Although different people mean different things when they use this term, the consensus is that data volumes are exploding, and that new tools are needed to handle these volumes. EMC and IDC have released a series of white papers estimating that in 2009, there were 800 exabytes of digital data stored worldwide (1 exabyte = 1,000 petabytes = 1,000,000 terabytes; see www.emc.com/digital_universe). By 2020, they predict that this number will be 44 times larger, or 35 zettabytes. (According to blogger Paul McNamara, if all of humanity were to post continuously to Twitter, it would take one century before the total volume of tweets occupied 1 zettabyte! See www.networkworld.com/community/node/60897.) This flood of data is due in large part to the increased desire to collect, query, and mine logs, clickstreams, and other Internet-based data. This is increasingly true as companies look to monetize their data through targeted advertising, clickstream analysis, and customer behavior profiling.
Such data isn't just localized on a single machine but spread across many machines, which are often geographically distributed. This distribution arises as more and more datasets are available via the Internet, and as organizations themselves become larger and more geographically diverse. In addition, we're realizing the need to give users more and exclusive control over their own data, as witnessed by the privacy discussions that surround online social networks. Requiring distributed control on top of geographical distribution further complicates data management.
This demand for data processing has given rise to new technologies and techniques for Internet-scale data management. These include numerous "NoSQL" tools, designed to provide access to structured data without using a conventional relational database (often to address scalability or fault-tolerance concerns), as well as new relational systems aimed at operating at larger scales or higher rates than conventional databases. Additionally, researchers have developed several systems for accessing unstructured textual data, or heterogeneous and "semistructured" data spread across the Internet.
In this Issue
This special issue of IC presents five articles covering a range of technologies related to Internet-scale data management. Taken together, they give an exciting overview of recent developments in this space.
In their article "PNUTS in Flight: Web-Scale Data Serving at Yahoo," Adam Silberstein and his colleagues provide an overview of their PNUTS system, a new database system that's widely used inside Yahoo. It provides several features designed to target wide-area scalability, including the use of relaxed consistency to improve wide-area query performance, and a simplified, key-value-based query interface targeted toward the needs of many of Yahoo's Web applications. The article describes PNUTS' numerous advanced features, including notifications, materialized views, and bulk operations.
In "Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches, and Trends," André Freitas, Edward Curry, João Gabriel Oliveira, and Seán O'Riain describe some of the challenges that arise when trying to build a system to write structured queries (in a language such as SPARQL or SQL) over heterogeneous datasets spread across the Web. Key issues include semantic heterogeneity, in which related datasets are encoded using diverse schemas that aren't always easy to combine. This difficulty arises for several reasons. First, two datasets might reference a particular person or object (entity) using different labels (keys)—this is the so-called entity-resolution problem. Second, two datasets might encode the same type of data (such as a salary) under different names (for example, "earnings" or "income"). The article surveys various techniques that have been developed to address these challenges and summarizes their strengths and limitations.
In their article, "Multiterm Keyword Search in NoSQL Systems," Christian von der Weth and Anwitaman Datta describe a general methodology for adding keyword search to NoSQL systems (that is, structured data storage systems without a SQL interface). Specifically, for NoSQL systems based on a key-value store interface, they describe how to create an index that lets users look up keys or values that contain specific words, or sets of words. They demonstrate that their approach provides excellent performance and scalability.
"Scalable Execution of Continuous Aggregation Queries over Web Data," by Rajeev Gupta and Krithi Ramamritham, describes a set of problems related to continuously monitoring a collection of distributed data sources for some condition. As an example, imagine a user who wishes to be notified when some condition over a collection of stock prices or volumes has been satisfied. Such applications might not involve huge volumes of stored data but still require a scalable infrastructure that can handle very high data rates. Many methods exist for architecting a system that provides this type of interface, and the authors survey these architectures and provide a set of guidelines summarizing the best one to choose depending on different architectural parameters (for instance, the update rate of underlying data, or number of queries).
Finally, in "Enabling Web Services to Consume and Produce Large Datasets," Spiros Koulouzis, Reginald Cushing, Konstantinos Karasavvas, Adam Belloum, and Marian Bubak describe the challenges associated with transferring large datasets using conventional SOAP-based Web services. Specifically, SOAP doesn't deal well with large messages, and using XML increases message sizes and inhibits performance. As an alternative, the authors propose a solution that uses SOAP messages to coordinate data transfer, while actually transferring large files and data streams using an out-of-band communication channel that better handles large volumes. They show that their design provides substantially better performance than conventional SOAP.
The Road Ahead
Viewed as a whole, these articles provide a snapshot of the exciting and diverse research happening in the area of Internet-scale data management. However, they're just a sampling of the broad challenges that Internet-scale data presents. The future of data management poses several fascinating questions.
How can recent advances in probabilistic models and statistical machine learning best be integrated into large-scale data management systems? Increasingly, users want to do more than simply query or extract simple statistics about their data, looking instead to fit functions, detect outliers, and make predictions on the data they store. Figuring out how these integrate with data management systems is nontrivial, especially as more sophisticated models were originally designed to operate on datasets that fit into a single machine's RAM.
Is there a better language than SQL for data access on the Internet? What is it? Ideally, the language chosen would be able to express rich computation over data—such as the statistical models described in the previous paragraph— in a more natural way than SQL-based user-defined functions, while preserving SQL's ability to parallelize naturally over a very large number of processors.
What is the role of analytic SQL-based databases vis-à-vis MapReduce? Several lively debates have occurred on this topic in blogs over the past few years, and both classes of systems are quite widely used. The widespread adoption of SQL-like interfaces such as Hive and Pig on top of MapReduce suggests that many users prefer a declarative, high-level language. Increasingly, these tools' authors are adopting techniques from the relational database community for query optimization, materialized views, and so on. Furthermore, database vendors are rapidly adding support to both ingest data from and output data to HDFS (the Hadoop MapReduce file system). Both trends suggest that these systems might converge, or at least learn to peacefully co-exist.
How can we provide end users with control of their data? As noted, concern is increasing that large quantities of information about individuals is in the hands of a few very large organizations. What's needed are simple mechanisms that give "data back to the people," yet at the same time allow that data to be used for their owners' benefit. For example, building good recommendation systems requires gathering and processing a broad range of data from different people to optimally serve individuals.
As multisensor devices such as smart phones become commonplace, how can we best process the data coming from them? Substantial research and development efforts are under way that will let us continuously monitor the whereabouts of people, thus generating self-tracking data (for example, detailed activity measurements), data on location and movement (estimating road traffic from phone position data), and so on. Likewise, enough sensory input will soon exist to let us accurately derive actual social ties. The processing of continuous data streams presents a new field of research in which large-scale data management plays a crucial role.
Indeed, the Internet already introduces a very diverse collection of applications and associated data, including microblogs such as Twitter, sensor and location data from mobile phones, and human-supplied data from crowdsourcing systems such as Amazon's Mechanical Turk. Streaming, storing, querying, and mining this data introduces yet more new challenges that will continue to drive research and innovation in data management.
Selected CS articles and columns are also available for free at http://ComputingNow.computer.org.
Sam Madden is an associate professor at the Massachusetts Institute of Technology. His research interests include databases, mobile computing, sensor networks, and distributed systems. Madden has a PhD in computer science from the University of California, Berkeley. Contact him at firstname.lastname@example.org.
Maarten van Steen is a full professor at VU University Amsterdam. His research interests include large-scale distributed systems and complex networks. Van Steen has a PhD in computer science from Leiden University. He's an associate editor in chief for IEEE Internet Computing and a member of IEEE and the ACM. Contact him at email@example.com.