Distributed Cache Goes Mainstream
by George Lawton
Distributed-cache technology took a major step forward in June with the release of Microsoft AppFabric. The technology promises to help scale data-intensive applications, and it's part of a trend that other database vendors, including Oracle and IBM, are pursuing. "The big guys are discovering this technology, which is a short-term business opportunity," said Massimo Pezzini, a Gartner Research analyst. "But more importantly, it's an enabling technology for a variety of cloud applications."
The significance is not so much the novelty of a new feature but rather the advent of distributed caching in all major application server products. It means cloud application developers can expect to have the tools available on any platform to distribute application memory up or down as required.
Over the past couple of years, distributed-caching systems have become more important as alternatives or supplements to traditional databases.They combine the benefits of distributed systems and DRAM storage, making it easier to scale an application by more efficiently using the local memory of each new server added to a system. These benefits help reduce bottlenecks associated with writing or reading data in many applications. "You reduce access latency because the data is in memory," explained Matt Davey, a director at Lab49, an IT consultancy, "and you reduce contention because access works across machines."
Interest in the field has also increased over the past couple of years with the popularity of open source implementations, such as Memcached, which are widely used on websites such as Facebook to improve scaling and performance.
Long History
Distributed-caching research extends back to the mid 1980s and David Gelernter's work at Yale University on tuple-space computing models, said Pezzini. The technology started to make more sense about a decade ago with the Internet's rise. It was commercialized by pioneers such as ScaleOut Software, GigaSpaces, GemStone (acquired by VMware), and Tangosol (acquired by Oracle). In the meantime, several open source variants were developed to help improve website performance. In addition to Memcached, these included Ehcache, JBoss Cache, and Terracotta Server Array.
The large database vendors began incorporating the technology into their portfolios. Oracle bought Tangosol. IBM developed WebSphere Extreme Computing (WXC), and Microsoft launched project Velocity, which became AppFabric. Other vendors have also started developing or buying distributed-caching technology — for example, VMware bought Gemstone in May.
David Brinker, ScaleOut Software's chief operating officer, said that three phenomena are driving distributed-caching technology: advances in networking and DRAM hardware, an uptick in data analysis applications, and the growth in cloud computing. One of its most promising applications is providing a mechanism for scaling applications up or down on demand in a way that efficiently uses the local storage on each server. Proponents believe this will make it easier to create applications that can automatically take advantage of cloud services.
Keeping Life Simple
One challenge is that distributed caching uses a programming model that differs from that of traditional databases. Vendors have hidden some of this complexity behind widely used programming APIs. "The developer doesn't need to know anything about the grid," Brinker said. "It's just as if you were writing to the memory on a single machine."
Developers do need to develop a good understanding of how to partition the data. This involves a detailed understanding of how it's accessed, noted Shalom. For example, if an application's data is partitioned by company but accessed mostly on the basis of employee, the queries are going to be less efficient.
There are also some nuances in how to get the data into the cache most efficiently, noted Davey. For example, it might make more sense to queue up stock market data and enter it into the cache in one larger file than to insert each new stock update individually.
Brinker identified three major generations of distributed-caching technology. The first is efficient at caching read-only data but lacks features for replicating data. The second generation has been optimized to improve the database performance with features such as high availability and better management tools. We're just now starting to see the third generation of solutions, which provide a distributed data fabric that moves applications to the data, rather than the other way around.
In 2010, the market for distributed-caching technology is about US$80-100 million worldwide, noted Pezzini, but vendors are more interested in it as an enabler for other products. "The innovation will continue not because it's a huge market," he said, "but because of the strategic element it brings to the cloud."
George Lawton is a freelance journalist based in Guerneville, CA. He can be reached via his website at http://glawton.com.