Guest Editor's Introduction • Dejan Milojicic• September 2011
The race toward the fastest computer has become more global than ever, as witnessed by the list of the world's top 500 supercomputers. Only two years ago, the three top supercomputers were from the US Department of Energy (DoE). The list is updated every six months; over the following year and a half, Chinese computers were ranked 2nd, then 1st. In the most recent list, a Japanese supercomputer came in first, leaving only Jaguar in Oakridge in the top five (ranked third). In addition to performance, sustainability or power efficiency is getting much more attention, and a "green list" is also maintained.
Aside from global competitiveness, the real drivers behind the race are applications that require more powerful hardware to address some key technical problems. Examples include combustion (see one of our feature papers below), extreme materials, and nuclear power. Both the US DARPA Ubiquitous High Performance Computing (PDF) and the forthcoming US DoE Exascale Computing (PDF) funding opportunities will help drive next-generation supercomputers. And, down the road, these achievements will also transfer to commodity computers. US national agencies have targeted exascale (1018 flops) as the next big step, and they expect this to be achieved by 2020.
Although amassing enough computers to reach this level of computational power is already possible, doing so within a limited power budget and with sufficient reliability are the key challenges. To put things in perspective, today’s top computer has a peak of 8.8 Pflops, while burning slightly less than 10 megawatts (MW) of power. If we simply scale it to exascale size, it will take us to over 1 gigawatt. The targets that the DoE is putting up are closer to 20MW (PDF). Similar reasoning applies to reliability.
What kind of challenges does exascale computing bring to us computer scientists, and where can the IEEE Computer Society help? We’ll have to revisit some fundamental assumptions, including
- memory, including non-volatile,
- CPU designs,
- power and cooling,
- co-design of systems and applications, and so on.
This month's theme addresses some of these topics. In "The Reliability Wall for Exascale Supercomputing," Xuejun Yang and colleagues highlight the significance of achieving scalable performance using fault tolerance. They quantify the effects of reliability on the scalability of peta/exascale systems by introducing the concepts of reliability speedup and "costup." They finally show how to mitigate reliability wall effects in system design and hardware software.
In “HyperX: Topology, Routing, and Packaging of Efficient Large-Scale Networks” (login required for full text), Jung Ho Ahn and colleagues introduce an extension of the hypercube and flattened butterfly topologies, the HyperX. It takes advantage of high-radix switch components that integrated photonics will make available. HyperX is a good candidate for exascale architecture because of its performance, packaging, and cost.
In "From Microprocessors to Nanostores: Rethinking Data-Centric Systems," Partha Ranganathan describes an active memory and non-volatile random-access memory (NVRAM) approach to addressing large amounts of data. He also introduces data-centric workloads to model and benchmark contemporary and future applications. He strongly advocates reevaluating the implications of data-centric workloads for system architectures. By the exascale era, NVRAMs will be more widely deployed, and new ways of addressing them will become critical.
Mark Giampapa and colleagues present an operating system for supercomputers in “Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene CNK” (login required for full text). In large-scale computing, the performance and reliability impacts of kernels on systems and applications are amplified significantly, hence the introduction of small, low-noise kernels, such as in IBM’s Blue Gene system. The authors demonstrate that such kernels can retain Linux compatibility without losing any of the low-noise and reliability aspects. This will be even more critical in exascale systems.
In "In Situ Visualization for Large Scale Combustion Simulations," Hongfeng Yu and colleagues discuss application visualization requirements in exascale systems. The typical approach to offline visualization does not work for huge amounts of data, so new approaches are required that collect data during runs and that can be used either online or offline. This kind of approach enables capturing and understanding some highly intermittent transient phenomena in turbulent combustion.
There are numerous other resources available on this topic; see the Related Resources page for a few to start with.