Industry reports indicate that digitally generated data is doubling roughly every year. This isn’t an evolution but rather a disruptive change that presents a new set of opportunities and problems to the storage industry and researchers. We’re already experiencing the next paradigm in data processing as enterprises and individuals adapt how they store, manage, and consume this tsunami of data.
With the advent of large-scale, globally distributed application architectures, storage infrastructure must enable transactional continuity and nondisruptive operation. Application architectures are driving the separation of performance and capacity in storage infrastructure. Clear separation between data management and storage management is also the order of the day. In essence, the storage infrastructure for big data is becoming not only application-driven but application-optimized.
Storage infrastructure enables data collection, storage, protection, and delivery for consumption. These days, it’s crucial to balance efforts at minimizing capital and operating costs for storage infrastructure while always heeding the storage mantra: “never lose data, and always serve it within the limits of user tolerance.” The emergence of storage-class memory and in-memory storage on the compute node is extending the storage tier from disks to the highest levels of performance. The storage tiers are also extending from disks to the other end of the pragmatic trade-offs with cloud storage-as-a-service. The big challenge with these widened storage tiers is how to ensure efficient data access for applications.
Distribution of data access is a nontrivial issue. A significant component of storage spending is storing and moving duplicate copies of data. As data transitions to big data, the problem of managing copies grows, thereby driving storage-efficiency innovations. The most important innovation arising from the big data infrastructure will come from application awareness and application management needs.
Data processing models are rapidly evolving beyond traditional relational database management systems (RDBMSs) and the more recent map-reduce paradigms. The near real-time needs of big data analytics are driving innovation in the data modeling and application architectures. Infrastructure architectures must evolve and adapt with this changing landscape of data models and application algorithms. This change presents myriad technical problems and research opportunities in evolving the underlying storage infrastructure.
Big Data Platform
The emergence of software-defined storage architectures has provided significant flexibility in storage management and operation. However, the evolution of storage infrastructure to effectively support the big data paradigm is far from over. Big data platforms must find the optimum balance among application continuity, data protection, access control, and processing performance. This balancing act will involve moving large blocks of data across the storage tiers of performance and capacity. Bear in mind that this data movement is bidirectional.
Near real-time processing requirements are forcing application developers to innovate in terms of receiving and processing data elements. It’s essential to minimize the overhead while processing the data elements but not lose the transaction if the processing node goes down. These requirements are far tighter than the traditional decision-support systems that processed data in batches.
The Posix (portable operating system interface) file system abstraction has long been the basis of organizing data elements on the storage infrastructure. Distributed file system operations are now mature for large-scale, reliable operations. However, newer applications and associated data types are motivating application developers to explore different routes to tame the storage infrastructure — or to be oblivious to it, perhaps expecting magic. By itself, this isn’t a new desire; relational database architects have traditionally managed the underlying storage systems directly without the overhead of standardized file systems. Another example of such application-driven optimization is to enable Hadoop to run natively on network file system (NFS)-based shared storage systems. With big data applications, we’re beginning to see the application middleware or the application itself directly drive such optimizations. Key-value stores, column-oriented databases, multimedia content files, document databases, and other new kinds of data modeling at larger scale combined with newer programming paradigms present new investigation opportunities.
In combination with application of the well-known CAP theorem, high-availability architectures at the infrastructure level enable transactional continuity, for example, no user transaction shall be lost, while the components of the system may fail. At big data scale, the infrastructure is expected to face ongoing changes at the hardware level due to repair and maintenance operations. Delivering consistent, near real-time performance, even when the infrastructure is failing, is an important property of such storage systems. Data security and encryption methods are now well developed at the storage systems level. At big data scale, the challenge shifts to effective trade-offs between access speed and access control.
Although big data signifies a paradigm shift in the way enterprises develop business insights, it isn’t practical to expect ongoing businesses to make an instantaneous switch. Therefore, hybrid architectures are the order of the day. These trends point to the creation of a data-fabric service that transports blocks of data efficiently across not only tiers of storage but storage clusters.
A good starting point for the solutions to come in the service model is Eli Collins’ “Big Data in the Public Cloud,” which explores how infrastructure consumption models are evolving in the context of applications. From there, Ganesh Chandra Deka presents various data modeling tools and paradigms for big data in his “Survey of Cloud Database Systems.” Together, these two articles set the context for how applications are evolving.
In “Storage Challenge: Where Will All That Big Data Go?,” Neal Leavitt gives an overview of the state-of-the-art in storage technologies and products, while indicating that the mix of storage options will look very different five years from now. Of course, these changes will not be without their challenges, and Yih-Farn Robin Chen’s “The Growing Pains of Cloud Storage” discusses the underlying issues in engineering large-scale storage systems designed to address capacity growth.
Xiaoxue Zhang and Feng Xu’s “Survey of Research on Big Data Storage” examines big data’s defining characteristics and associated challenges. To address such challenges, Fedi Gebara, H. Peter Hofstee, and Kevin Nowka argue in “Second-Generation Big Data Systems” for the need to incorporate appropriate structures to support multiple analytic methods on varied data types, as well as the ability to respond in near real-time.
“Toward Scalable Systems for Big Data Analytics: A Technology Tutorial,” by Han Hu and his colleagues, touches on all aspects of the big data paradigm and infrastructure. The highlight of this tutorial is the literature survey and its rich set of references.
Industry Perspective Video
Brian Marshall on changes in the landscape of storage infrastructure.
Industry Perspective Video
Ion Stoica on infrastructure utilization.
Ion Stoica is a professor in the Electrical Engineering & Computer Sciences Department at the University of California, Berkeley. Since 2013, he has served as CEO of Databricks, a startup he cofounded to commercialize technologies for big data processing. In this short video, he talks about how users are demanding fast answers and discusses techniques for utilizing the infrastructure better.
In our second Industry Perspectives video, Brian Marshall, VP of Corporate Development at Hortonworks, discusses changes in the landscape of storage infrastructure. Marshall has spent the past 15 years conducting fundamental research on the technology sector from both a buy-side and sell-side perspective.
Trends indicate that the application layer will drive the next innovations in storage infrastructure. The storage infrastructure is delivered as a service via the self-service paradigm with sophisticated automation. Because business agility needs are motivating application development and operations workers (the DevOps model) to lead innovation in the datacenter, the storage infrastructure will evolve to work seamlessly with computing and networking.
People looking for opportunities to solve challenging problems in big data infrastructure must “look up” to the application layer. For example, the Storage Networking Industry Association (SNIA) organizes a dedicated committee focused on analytics and big data. SNIA provides a wealth of information on how the storage infrastructure industry is evolving to meet this challenge.
S. Nagarajan, “Is the Application Layer Fueling Innovation in Storage Infrastructure?,” Computing Now, vol. 8, no. 5, May 2015, IEEE Computer Society [online]; http://www.computer.org/publications/tech-news/computing-now/is-the-application-layer-fueling-innovation-in-storage-infrastructure.
Sundara Nagarajan is a technical director with NetApp, based in Bangalore, India. He’s also Computing Now’s regional liaison to IEEE Computer Society activities in India. Visit his LinkedIn profile at www.linkedin.com/in/nagarajan or contact him at s.nagarajan at computer dot org.