The Community for Technology Leaders

Guest Editors' Introduction: Evaluating Servers with Commercial Workloads

Kimberly , Hewlett-Packard Laboratories
Russell , Fabric7 Systems, Inc.
Ashwini , IBM T.J. Watson Research Center

Pages: pp. 29-32

Abstract—Commercial workloads, which differ markedly from scientific and technical workloads, can more accurately evaluate new server designs.

The vast majority of multiprocessor server systems shipped today run commercial workloads. These workloads include classic database applications such as online transaction processing (OLTP) and decision support systems (DSS), as well as newer workloads including Web server, e-mail server, and multitier e-commerce applications. However, much of the research that influenced the design of these servers used scientific and technical workloads, such as the Standard Performance Evaluation Corporation's SPECint and SPECfp benchmarks.

Several factors motivated these choices, such as research funding agencies' priorities, researchers' experience with technical workloads, and the difficulty of working with commercial workloads, including their large hardware requirements, the complexity of tuning their hardware and software, and the lack of access to commercial application source code. This trend has been changing in recent years, however.


With the maturing of commercial workload benchmarks for multiprocessor servers, described in the " Commercial Workload Benchmarks" sidebar, research studies have emerged that show how these workloads' behavior differs significantly from technical and scientific workloads. For example, these studies have generated several points of common wisdom for OLTP workloads. The many concurrent users in these applications lead to higher multiprogramming levels and context switch rates. As a result, OLTP workloads spend a nonnegligible percentage of their execution time in the operating system.

Typically short-lived, OLTP workload transactions access a small fraction of the overall database, resulting in random I/O access patterns that use relatively small disk request sizes. OLTP's cycles per instruction rating runs considerably higher than the SPEC integer benchmarks' CPI. The majority of these cycles consists of instruction- and data-cache miss stalls from OLTP's random data access patterns and nonlooping branch behavior. As a result, these applications are more sensitive to memory latencies.

Researchers have also studied the impact of DSS workloads on server design. Often more complex and longer-running than OLTP queries, DSS queries typically scan large amounts of data, resulting in sequential disk I/O access patterns using large disk request sizes. Due to their sequential I/O patterns and lower multiprogramming levels, DSS workloads typically spend less time in the operating system. DSS queries also tend to have lower CPIs than OLTP, and relatively better cache hit ratios. With lower sensitivity to memory latency and high I/O bandwidth requirements, DSS workloads closely resemble some technical workloads.


Researchers have used several evaluation techniques to study commercial workloads. Processor and chipset performance counters can measure processor and memory system behavior without slowing down application execution. Performance counters measure only the underlying hardware architecture, without permitting significant exploration of architectural alternatives.

Analytical models

To address this shortcoming, researchers have built analytical and simulation models of the underlying system. Analytically modeling commercial workload behavior presents a significantly greater challenge than modeling scientific and technical workloads. In many cases, scientific and technical workload performance can be computed as the throughput of the floating point units on the system's processors; the operating system accounts for an insignificant fraction of the execution time. In contrast, the combination of nonlooping branch behavior, frequent calls to the OS, and a large degree of multiprogramming makes analytically modeling multiuser commercial server workloads difficult.

Thus, developers base their models for analyzing how well new designs handle commercial workloads on data collected from existing systems that use processor and chipset performance counters. Although this approach has been used with some success, it does not capture the changes in the workload's execution based on the new system's environment.

Full-system simulation

Full-system simulation offers another approach to modeling new designs. Developers created simulators such as SimOS and SimICS to permit simulation of applications, the operating system, and architectural components. These simulators provide more accurate characterization of commercial workloads and enable exploration of architectural alternatives to better support those workloads. However, they can simulate only scaled-down versions of these workloads due to the exorbitant simulation time and space required to run real-life database sizes.

To address this limitation, a few studies have proposed rules of thumb for scaling back OLTP benchmarks to reduce the disk-space requirements. Others have proposed simplified database microbenchmarks to approximate the behavior of OLTP and DSS workloads.


The ongoing evolution of workloads further complicates the continuing challenge in modeling commercial workloads for architectural evaluation. The growth of networking has led to application deployments distributed across heterogeneous systems connected by high-speed networks. These multitier workloads result in systems that have a specialized function in each tier, including a Web server front-end tier, an application server middle tier, and a database server back-end tier. Researchers have only begun to characterize the behavior of these workloads' different components. Recently developed benchmarks that aim to represent their behavior include the TPC-W, SPECweb, SPECjbb, and ECPerf benchmarks.


In this special issue, we seek to provide computing practitioners with an overview of commercial workload characteristics and how these workloads exercise computer systems. Specifically, the articles in this issue address the characteristics of multitier benchmarks and new approaches to the problem of accurately and cost-effectively modeling complex commercial workloads.

"Benchmarking Internet Servers on Superscalar Machines," by Yue Luo and colleagues, evaluates middle-tier Internet server application behavior on three different microarchitectural platforms, employing built-in processor hardware counters. These authors find that some of the same trends observed for multiuser database back-end workloads also apply to middle-tier applications.

In "TPC-W E-Commerce Benchmark Evaluation," Daniel and Javier GarcĂ­a present a detailed characterization of the TPC-W multitier e-commerce benchmark. This analysis includes data on the sensitivity of benchmark performance to different hardware and software configuration options for a given testbed environment. These authors also discuss the appropriateness of the benchmark for representing different e-commerce server-usage models.

"Simulating a $2M Commercial Server on a $2K PC," by Alaa Alameldeen and colleagues, describes techniques for scaling back and tuning large commercial workloads so that they can be simulated on the less-expensive, less-powerful machines available to most researchers. The authors describe the Wisconsin Commercial Workload Suite, comprised of four scaled-down benchmarks that approximate workloads for an OLTP database, Java middleware, and static and dynamic Web servers.

In "Queuing Simulation Model for Multiprocessor Systems," Thin-Fong Tsuei and Wayne Yamamoto present a processor-queuing model that projects the performance characteristics of commercial workloads without requiring the complexity of execution- or trace-driven simulation. Given the workload complexity, they use a hybrid analytical model approach that lets event rates drive the simulation. When compared to traditional analytical models, their approach offers the advantage that it can capture the burstiness in time of different events to more accurately model their effect on performance.

In "Designing Computer Architecture Research Workloads," Lieven Eeckhout and colleagues present a methodology for designing a short-running workload that behaves similarly to a long-running one, through the application of principal-component-analysis techniques. Using this approach, they show how the behavior of reduced or sampled versions of long-running benchmark suites can be validated against the full suite they aim to model. This approach can also be used to compare the detailed execution characteristics of different workloads.


Although much research progress has been made, many questions remain. Despite the importance of commercial workloads, over the past 10 years less than 15 percent of all evaluations in major computer architecture conferences used them. Further, workloads will continue to evolve, and new methodological questions will arise. For instance, what is the best way to fully simulate larger-scale multitier systems? What is the right way to set application configuration parameters and scale data sets?

Looking forward, we expect new developments in computer system instrumentation and simulation environments that will let researchers and architects better evaluate their designs before implementation. We also anticipate continued progress in using commercial workloads to evaluate server designs.

Commercial Workload Benchmarks

Over the years, industry consortia have developed standardized benchmarks for various classes of commercial workloads. The Transaction Processing Performance Council has developed the most popular database benchmarks ( The TPC strives to provide representative workloads for commercial server evaluation.

The TPC-C benchmark currently represents OLTP workloads, modeling an order-entry environment with transactions such as entering and delivering orders, recording payments, checking orders, and monitoring warehouse stock levels. The benchmark emulates a client-server system in which many concurrent users access and modify the database. TPC-C measures performance as the number of new order transactions per minute that satisfy a response time constraint.

TPC-H and TPC-R currently represent decision-support-system workloads. TPC-H supports business-oriented ad hoc queries and concurrent data modifications. TPC-R, similar to TPC-H, allows additional optimizations based on advance knowledge of the queries. Both benchmarks use two refresh functions and 22 read-only queries, executed in single-user mode via the power test and in multiuser mode via the throughput test. A composite performance metric expresses the number of queries the system can perform in an hour, giving equal weight to single- and multiuser modes.

TPC-W and the Standard Performance Evaluation Corporation's SPECjbb2000 and SPECweb99 benchmarks represent Internet-driven workloads ( TPC-W, a multitier e-commerce benchmark, simulates the activities of a business-oriented transactional Web server. It models different user access patterns using profiles that vary the ratio of browsing to buying, and expresses performance in the number of Web interactions per second completed by the server.

SPECjbb2000 focuses on the evaluation of server-side Java workloads running on middle-tier application servers. To isolate the application server tier, the benchmark's structure differs from that of the TPC benchmarks in that driver threads emulate clients and it stores data in binary trees outside a database. The SPECjbb2000 ops/second performance metric is a composite throughput measurement that represents the averaged throughput over a range of points. SPEC has also defined SPECweb99, which simulates accesses to a Web server that hosts Web pages for several different organizations. The performance metric is the number of simultaneous connections the Web server can support using a predefined workload, while still maintaining specific throughput and error-rate requirements.

Finally, several benchmarks emulate e-mail server environments, including SPECmail2001 and the Storage Performance Council's SPC-1 ( SPECmail2001 characterizes a mail server's throughput and response time in an ISP environment that supports 10,000 to 1,000,000 users. Its performance metric is messages per minute.

The SPC-1 benchmark evaluates storage area network performance for the random I/O operations typified by e-mail server and OLTP environments. It measures multiuser system performance, using as its metric the maximum I/Os per second the storage system can provide while maintaining acceptable response time.


Interest in this special issue grew out of the Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW;{98,99,00,01} and{02,03}) that we, along with Josep Torrellas at the University of Illinois at Urbana-Champaign, have organized for the past six years. We thank the workshop participants for many stimulating discussions and presentations; the authors of the articles in this special issue for providing such interesting studies to feature; and the anonymous reviewers.

About the Authors

Kimberly Keeton is a research scientist in the Storage Systems Department at Hewlett-Packard Laboratories, where her research focuses on workload characterization, data dependability, and self-managing storage systems. Keeton received a PhD in computer science from the University of California, Berkeley. She is a member of the IEEE, the ACM, and Usenix. Contact her at
Russell Clapp is a principal engineer at Fabric7 Systems. His research interests include systems architecture and performance analysis. Clapp received a PhD in computer science and engineering from the University of Michigan. He is a member of the IEEE and the ACM. Contact him at
Ashwini Nanda is a research staff member at the IBM T.J. Watson Research Center. His research interests include computer systems architecture, performance, and rich media systems. He currently serves on the Editorial Board of IEEE Transactions on Parallel and Distributed Systems. Contact him at
56 ms
(Ver 3.x)