Improving Energy Efficiency and Exploiting Parallelism with Processing in Memory and Near-Data Processing
Guest Editors’ Introduction • Kevin Rudd and Richard Murphy • March 2017
Translations by Osvaldo Perez and Tiejun Huang
Listen to the Guest Editors' Introduction
English (Steve Woods):
Spanish (Martin Omana):
Chinese (Robert Hsu):
The Big Data era promises some significant advancements but also introduces many new challenges. The memory and processing power required for storing and analyzing that data will quickly surpass what our current infrastructure can provide. Massive data centers are expensive, and trends indicate that both demand and cost will continue to grow faster than Moore’s Law.
Memory is projected to consume an increasing fraction of total system power for a balanced exascale platform. Maya Gokhale of Lawrence Livermore National Laboratory has shown that a 64-bit integer operation costs roughly 1 picojoule (pJ), but that reading DRAM to get that data (not even accounting for data-transfer costs) is 16,000 pJ per bit. Additionally, Bill Dally of Stanford has estimated that a 32-bit floating-point operation requires 3.1 pJ, whereas the same DRAM read requires 640 pJ.
Clearly, moving data to the processor (and back) is an unsustainable approach to computation; instead, why not move the processing to the data? This March 2017 Computing Now theme examines current research in processing in memory (PIM) and near-data processing (NDP), approaches that could become timely solutions to many problems. In addition to featuring seven recent articles on the topic, we've included sidebars with related articles and resources to provide a sense of how the field has developed over the past 50 years.
The idea of moving computation to the data is not new. Around 300 BCE, scholars came to study at the Library of Alexandria, where they had access to all books at once, rather than sending for one at a time. In the same way, PIM moves processing into the memory, so that it doesn’t have to fetch one datum at a time.
PIM has been tried many times throughout the Information Age:
- In the late 1960s, Harold Stone proposed logic in the cache, with both the processor and memory on the same module.
- In the1980s, the Inmos Transputer had both the processor and memory on the die.
- In the 1990s, Peter Kogge's EXECUBE had both the processor and memory on the die.
- In the 2000s, Christoforos Kozyrakis's VIRAM vector processor was embedded in DRAM.
- In the 2010s, the AMD TOP-PIM had both the processor and memory on a 3D die stack.
These and many other implementations met with limited real-world success, struggling with high costs and limited memory capacities.
A Modern Approach
If combining processing and memory didn't succeed in the past, why should we continue to investigate the idea now? The simple answer is that today’s technology can overcome the problems of the past. We now have:
- much denser memory technologies and demonstrated 3D memory stacks;
- very efficient 2.5D assemblies and 3D stacks of heterogeneous die;
- new architectures and fabrication technologies to help balance capabilities and requirements; and
- better data movement capabilities and protocols.
Additionally, we've now hit the memory and power walls, as well as broken both Moore's and Dennard's laws—both of which bailed us out for many generations of computing systems. We believe the time is right and the demand is strong for this approach.
In “Fine-Grained Task Migration for Graph Algorithms using Processing in Memory,” Paula Aguilera and her colleagues use the irregular access patterns of graph-based algorithms to motivate the need for 3D memory cube-based NDP. The article combines many of the core historic elements of classic PIM (including work moving) to provide acceleration for graph algorithms in a modern heterogeneous design.
Motivated by the end of Dennard scaling, which has resulted in relatively flat single-threaded performance, “Practical Near-Data Processing for In-Memory Analytics Frameworks” proposes PIM-based systems for data analytics. Mingyu Gao, Grant Ayers, and Christos Kozyrakis explore coherency and synchronization issues. The article uses a broader set of applications than the previous article for motivation, examining MapReduce and deep neural networks in addition to graph problems.
“Near-DRAM Acceleration with Single-ISA Heterogeneous Processing in Standard Memory Modules” examines the potential for accelerating a unified architecture by moving execution units nearer to the standard DRAM, while maintaining a single ISA. Hadi Asghari-Moghaddam and his colleagues argue that their system consumes almost 65 percent less energy than the baseline system at nearly twice the performance. This article is reminiscent of Stone’s early work.
Ping Chi and her colleagues examine the potential for utilizing novel memory device technologies to implement emerging computational models in “PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory.” This article provides an interesting discussion of the potential trade-offs in implementing new approaches to computation with fabrication technologies optimized for memory.
“HAMLeT Architecture for Parallel Data Reorganization in Memory,” by Berkin Akin, Franz Franchetti, and James C. Hoe, explores the potential of using PIM to facilitate data reorganization while simultaneously maintaining memory system services for the host processor. This is an often-overlooked topic in PIM research, as memory systems are fundamentally data-movement engines.
“Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems” describes a system that combines hardware and software mechanisms to transparently offload processing to PIM-based GPU accelerators in a 3D logic layer. Kevin Hsieh and his colleagues argue that TOM significantly improves performance compared to a baseline GPU system that can’t offload computation to 3D-stacked memories.
“HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing” provides the first examination of using DRAM to implement reconfigurable logic, which yields a different operating point than traditional FPGA-based approaches. Mingyu Gao and Christos Kozyrakis describe how their method combines coarse- and fine-grained logic blocks, separates routing networks for data and control signals, and uses specialized units to effectively support branch operations and irregular data layouts in analytics workloads.
The end of Moore’s law is sparking renewed interest in PIM and NDP approaches, as people turn to architecture to provide the scaling benefits that were previously achieved through transistor technology. We hope this Computing Now theme will inspire further research into these important energy-efficient solutions.
Kevin Rudd is a computer systems researcher at the Laboratory for Physical Sciences and an editorial board member of IEEE Micro and Computing Now. Contact him at firstname.lastname@example.org.
Richard Murphy is the director of Advanced Computing Solutions Pathfinding at Micron. His group focuses on R&D in computer architecture, memory systems, supercomputers, data analytics platforms at all scales, and disruptive mobile/embedded technologies, particularly PIM. He has led several large multidisciplinary teams to deploy new technologies. He also cofounded the Graph 500 benchmark, which has served as a catalyst to identify data-movement challenges in large-scale analytics problems. Murphy previously worked at Sandia National Laboratories, Sun Microsystems, and Qualcomm. He is an adjunct faculty member at the Georgia Institute of Technology and Boise State University, and has authored more than two dozen papers and patents. He holds a PhD in computer science and engineering from the University of Notre Dame and is a senior IEEE member. Visit his personal webpage at http://richardmurphy.net, or contact him at email@example.com.