2014 23rd International Conference on Parallel Architecture and Compilation (PACT) (2014)
Aug. 23, 2014 to Aug. 27, 2014
Ehsan Fatehi , Electrical and Computer Engineering, Texas A&M University
Paul V. Gratz , Electrical and Computer Engineering, Texas A&M University
With the breakdown of Dennard scaling, future processor designs will be at the mercy of power limits as Chip MultiProcessor (CMP) designs scale out to many-cores. It is critical, therefore, that future CMPs be optimally designed in terms of performance efficiency with respect to power. A characterization analysis of future workloads is imperative to ensure maximum returns of performance per Watt consumed. Hence, a detailed analysis of emerging workloads is necessary to understand their characteristics with respect to hardware in terms of power and performance tradeoffs. In this paper, we conduct a limit study simultaneously analyzing the two dominant forms of parallelism exploited by modern computer architectures: Instruction Level Parallelism (ILP) and Thread Level Parallelism (TLP). This study gives insights into the upper bounds of performance that future architectures can achieve. Furthermore it identifies the bottlenecks of emerging workloads. To the best of our knowledge, our work is the first study that combines the two forms of parallelism into one study with modern applications. We evaluate the PARSEC multithreaded benchmark suite using a specialized trace-driven simulator. We make several contributions describing the high-level behavior of next-generation applications. For example, we show these applications contain up to a factor of 929× more ILP than what is currently being extracted from real machines. We then show the effects of breaking the application into increasing numbers of threads (exploiting TLP), instruction window size, realistic branch prediction, realistic memory latency, and thread dependencies on exploitable ILP. Our examination shows that theses benchmarks differed vastly from one another. As a result, we expect no single, homogeneous, micro-architecture will work optimally for all, arguing for reconfigurable, heterogeneous designs.
Benchmark testing, Parallel processing, Multicore processing, Instruction sets, Upper bound, Transistors
E. Fatehi and P. V. Gratz, "ILP and TLP in shared memory applications: A limit study," 2014 23rd International Conference on Parallel Architecture and Compilation (PACT), Edmonton, Canada, 2014, pp. 113-125.