The Community for Technology Leaders

Robust on-chip communication

Pradip , IBM T.J. Watson Research Center

Pages: p. 5

The balance between computation and communication is a fundamental issue in application performance. As the chip industry gets entrenched in multicore architectures, this issue will become particularly and progressively important—not only in terms of chip-level performance, but also in terms of power, temperature, and reliability. Reliable communication is essential to overall computational robustness. As power constraints coupled with standard technology scaling drive down the supply voltage, various types of "noise" start to have severe effects on robust operation. Transient errors at all levels of the compute, store, and communicate paradigm begin to creep up. The solution for keeping error rates under acceptable margins may be one or both of

  • pervasive use of error-correcting codes and/or parity-based detection with instruction retry (recovery) mechanisms built into the hardware, or
  • operation at higher supply voltages to ensure acceptable margins of signal-to-noise ratio—especially in communication modes.

In both cases, the price paid is at least extra power, if not extra power and area. Higher supply voltages than the nominal point for a given technology generation, or higher power or temperatures, would imply higher dissipation cost and lower lifetime reliability. Many on-chip failure mechanisms—such as temperature-sensitive, time-dependent dielectric breakdown, electromigration, negative-bias temperature instability, and so forth—are strong functions of either the voltage (field) or temperature or both. In other words, straightforward solutions to guard against transient errors may result in an earlier onset of permanent failures and cost more in terms of power dissipation.

Thus, solutions for continued robust computing, storage, and communication at the chip level in the late (deep-submicron) CMOS technology era must be carefully architected. This requires appropriate trade-offs among cost, performance, reliability, and power. The nature of these trade-offs will, of course, depend on the target market for a particular chip.

The multicore chip era of high-performance computing will naturally impose increasingly high demands on data bandwidth. While most of the stress will be on chip I/O bandwidth, intrachip (intercore) communication bottlenecks will also prevail. Hence, in view of the projected decrease of communication reliability, the architecture and design of error- and fault-tolerant on-chip interconnects will likely become a focal point after more basic (compute-and-store) elements of chip-level design constraints are overcome. This focus will sharpen if, or when, the power and thermal attention shifts from the core processing units to the communication network—a shift that is arguably already visible, as vendors strive to put more cores on the same die.

This special issue is devoted to the topic of high-performance, on-chip interconnects. The guest editors' introduction explains how they put this issue together with papers selected after a stringent review process from those submitted. I hope the guest editors' efforts are successful in drawing attention to this important aspect of current and future chip design. We hope these articles will inspire other researchers and practicing engineers to learn more about the latest developments in the field and pursue new ideas and methods that improve the state of the art.

62 ms
(Ver 3.x)