In this message, I want to highlight the issue of workload characterization in the modern setting. Understanding the target workloads—from the perspective of performance, power, temperature, and even reliability—is an essential element of the overall design process. This is especially true at the early stages of microarchitecture definition.
Note, first of all, the emphasis on the qualifier "target." It is crucial for a design team to understand the application space of the product during the project's early stages. This is, admittedly, often a big problem because understanding the future user applications at the early stage is difficult. In the general-purpose microprocessor design space, a team might limit the benchmark suite to, say, SPEC 2000, only to see after chip tape-out that the very latest announced SPEC suite is quite different. Scaling today's workloads to anticipate the requirements of tomorrow's customers is a challenge; but this is an aspect that the team should definitely try to factor into its overall methodology.
Once the target workload suite (suitably scaled or not) is chosen, the performance modeling team often needs to adopt an intelligent and representative sampling policy to cut down on simulation time. The underlying research that drives such a sampling methodology must have a quantitative method of certifying the sampled (compressed) trace or workload as being adequately representative of the full application trace. This area of R&D continues to be a challenging one, as chip microarchitectures evolve, despite many past advances.
Whether it is for developing a representativeness check metric for sampled traces, or for simply understanding the key characteristics of the workload, you then must propose a set of measurements that, when applied to the workload, quantify its basic characteristics. This is not just a useful step, it is mandatory if the design team's goal is to come up with a well-balanced processor, with the right level of investment in the right microarchitectural features.
Good old Amdahl's law is, of course, the key implicit driver of such an application-driven design thought. For example, if a priori characterization of memory reference streams shows that the probability of bank conflicts for a proposed data cache interleaving scheme is quite small, then an initial design point that involves an interleaved cache (instead of a true dual-ported cache) might make sense. After all, the area and power advantage of an interleaved cache design (over a physically dual-ported design) is intuitively far greater than the expected performance loss in that case.
On the other hand, if dependence analysis shows that back-to-back dependences of issuable integer instructions are relatively rare, and having zero-cycle bubble between back-to-back integer issue in a pipelined superscalar processor is an expensive proposition, then the initial design point is likely not to strive for an interleaved cache.
You can, of course, similarly argue about the need for characterizing target workload suite from the point of view of power and temperature metrics, or inherent vulnerabilities to soft errors. The target applications might, for example, have a lot of built-in redundancies that result in significant masking of soft errors at the machine level. It would be good to know this before proposing an elaborate on-chip detection and recovery mechanism that consumes significant area and power. Selective protection of the most vulnerable resources, guided by workload characteristics, might be a design philosophy that yields a much more efficient end result.
You might ask: Why invest in elaborate characterizations of the workload in isolation? Why not just run it through the cycle-accurate processor simulator and do classical design trade-off analysis studies (performance, power, and so on) to prune the design space and come up with a plausible, efficient, design point? The answer is at least twofold:
• The cycle-accurate simulator—with built-in power, thermal, and reliability models—is usually not available in believable form very early when the basic design point decisions are being made.
• Elaborate design trade-off experiments with a detailed processor simulator take lots of time. It is often hard to construct multidimensional trade-off analysis views for the design team to act upon quickly during the key decision period.
A project team might conduct workload characterization once and save it in a database. That way, the team can frame different possibly multidimensional queries about application behavior later, as needed in making very basic microarchitectural decisions. Fine-tuning details and perhaps revising a small subset of earlier decisions is always possible later, when a cycle-accurate simulator is available.
Some workload characteristics, such as dynamic-instruction frequency mix and dependence distances, are largely microarchitecture-independent. Even here, however, the exact microarchitecture might impose some distortion. For example, in a highly speculative processor (with aggressive branch prediction and prefetch), the dynamic-instruction frequency mix of dispatched instruction might differ somewhat from that of the final completed instructions. Yet, the machine-independent view usually captures much of the inherent characteristics relevant for a particular decision.
Designers can project the cache-miss characteristics of a workload using a multiconfiguration, single-pass cache simulation; so, in that sense, such analysis is also a machine-independent view. Other workload metrics are hard to define without considering the actual microarchitecture pipeline. In such cases, running the target workloads on prior-generation machines can yield relevant statistics; suitable scaling might be necessary to make the data useful for a future development project.
Then, of course, there are some very specific details and questions that should indeed wait for the availability of a cycle-accurate simulator. By the way, a good job of a prior workload characterization pays off immensely in interpreting and validating results obtained from the detailed simulator. That's another reason why investing in the workload analysis and measurement work is a good idea.
This view also forces the design team to think about the programming model, compiler optimizations, and application tuning before (or at least in conjunction with) actual product definition, instead of just as an afterthought. In the current era of multicore processor chip designs (quite possibly with heterogeneous computing, communication, and storage elements), proper attention to characterizing the target workloads has become even more important than before. I would like to see our community invest more in terms of R&D activity in this area, much more than what we have witnessed in recent years. Specifically, I would like to include workload characterization as a major part of the scope of IEEE Micro; and, as such, I would like to invite more articles in this area from potential authors. Both academic research and experience-based methodology or measurement-based submissions from industry are welcomed, of course.
This particular theme issue links to the well-known Hot Chips conference. This is one of the most popular issues, where readers are able to learn about many details of the very recently announced real microprocessor chip offerings. I hope you enjoy reading the articles, even though you may ask yourself for some of these: "Hmm … great chip, but how do I use it? Did these folks think about the target workloads and usage characteristics?"
Editor in Chief