Pages: pp. 30-31
Abstract—With exascale multicores, the question of how to efficiently support a shared memory model is of paramount importance. As programmers demand the convenience of coherent shared memory, ever-growing core counts place higher demands on memory subsystems, and increasing on-chip distances mean that interconnect delays exert a significant effect on memory access latencies.
Over the past decade, heat dissipation limits have halted the drive toward ever-higher core frequencies, but transistor density continues to grow, 1 and CPUs with eight or more cores are now common in the commodity- and server-class general-purpose processor markets. 2 To improve performance further and use available transistors more efficiently, architects are resorting to medium- and large-scale multicore processors, both in academia—Raw, 3 TRIPS, 4 and the Execution Migration Machine, 5 for example—and in industry—Tilera, 6,7 Intel TeraFLOPS, 8 and Intel Phi. 9 Pundits predict 1,000 or more cores within only a few years. 10
How will these massively multicore chips be programmed? A shared memory abstraction stands out as a sine qua non for general-purpose programming. Although architectures with restricted memory models, most notably GPUs, are extremely successful in specific applications such as rendering graphics, most programmers prefer a shared memory model, 11 and commercial general-purpose multicores support this abstraction in hardware. The main question, then, is how to efficiently provide coherent shared memory on the scale of hundreds—or thousands—of cores.
These cores will typically contain per-core L1 and L2 caches because cache power requirements grow quadratically with size; therefore, the only practical option for implementing a large cache is to physically distribute it on the chip so that every core is near some portion of the cache. 7,9
For programmability, these cores should present a unified addressing space, and, for efficiency, this space must be managed automatically at the hardware level. Cache coherence must be maintained at the L1 caches (and at L2 in case of private L2 caches) to support shared memory. Snooping cache coherence protocols, popular at small (such as four) core counts, are not viable in many-core architectures due to serialization overheads. Another difficulty arises from the fact that conventional bus-and-crossbar interconnects no longer scale due to bandwidth or area limitations. Many-core chip multiprocessors (CMPs), instead, tend toward a tiled architecture—where arrays of tiles are connected over a point-to-point on-chip interconnect.
On scales in which bus-based mechanisms fail, the traditional solution for hardware-shared memory is directory-based cache coherence, where a logically central directory coordinates sharing among the per-core caches, and each core cache must negotiate shared (read-only) or exclusive (read/write) access to each cache line via a coherence protocol. However, the use of directories poses its own challenges. Coherence traffic can be significant, which increases interconnect delay, congestion, and power usage. Additionally, the performance of applications can suffer due to long latency between directories and requestors, especially for shared read/write data; directory sizes must equal a significant portion of the combined size of the per-core caches, or else directory evictions will limit performance. 12
In this special issue, the first two articles address these directory-related challenges and provide novel techniques for improving power and performance of directory-based coherence.
In “The Impact of Dynamic Directories on Multicore Interconnects,” Matthew Schuchhardt and his colleagues recognize that a large fraction of on-chip traffic originates not from actual data transfers, but from intercore communication to maintain data coherence. They describe how placing directories near the data sharers eliminates a large fraction of on-chip interconnect traversals.
In “An Application-Tailored Approach to Hardware Cache Coherence,” Arrvindh Shriraman, Hongzhou Zhao, and Sandhya Dwarkadas propose tailoring coherence support to applications in many-core processors. They present a technique that reduces directory storage requirements by recognizing sharing patterns. Additionally, they present a protocol-level technique to exploit the spatial access granularity of an application and thereby avoid communication of unnecessary data.
In the third article, “Single-Cycle Multihop Asynchronous Repeated Traversal: A SMART Future for Reconfigurable On-Chip Networks,” Tushar Krishna and his colleagues directly address the latency issue of the on-chip interconnect. The SMART (single-cycle multihop asynchronous repeated traversal) on-chip network presents a single-cycle path all the way from the source to the destination, irrespective of the physical number of hops between them. SMART potentially reduces the latency of coherence traffic significantly.
The fourth article, “Toward Holistic Soft-Error-Resilient Shared-Memory Multicores” by Qingchuan Shi and Omer Khan, presents an error-resilient shared-memory multicore architecture. Their distributed redundancy control mechanism operates in concert with the coherence protocol to enable a deterministic rolled program state for redundant execution at per-core granularity. Additionally, Shi and Khan combine their technique with a resilient cache coherence protocol to trade off performance and energy for soft-error coverage. 13
There are, of course, other challenges in the design and implementation of many-core systems in addition to hardware shared memory—off-chip bandwidth requirements and on-chip power budgets, to name two. Enabling many-core shared memory will go a long way toward easing programmer burden, so the focus can be on application optimization to reduce bandwidth and energy requirements.