Issue No. 07 - July (2005 vol. 38)
ISSN: 0018-9162
pp: 36-40
Wayne Wolf , Princeton University
Hannu Tenhunen , Royal University of Technology, Stockholm
Multiprocessor systems-on-chips are one of the key applications of VLSI technology today. MPSoCs embody complex systems and enable large markets that leverage the large investments required for advanced VLSI fabrication lines.
Many applications, such as mobile systems, require single-chip implementations to meet the application's size and power consumption requirements. Other applications leverage large chips to reduce system cost. Systems-on-chips provide single-chip solutions in all these cases. SoCs are often customized to the application to improve their power/performance ratio or their cost.
Some of today's complex applications may inherently require programmability, as in the case of Java-enabled devices. Often, the only way to make the system work in a reasonable amount of time is to build programmable components. While single processors may be sufficient for low-performance applications that are typical of early microcontrollers, an increasing number of applications require multiprocessors to meet their performance goals.
MPSoCs, therefore, are increasingly used to build complex integrated systems. A multiprocessor system-on-chip is more than just a rack of processors shrunk down to a single chip. Both application requirements and implementation constraints push developers to build custom, heterogeneous architectures.
The applications that SoC designs target exhibit a punishing combination of constraints:

• not simply high computation rates, but real-time performance that meets deadlines;

• low power or energy consumption; and

• low cost.

Each of these constraints is difficult in itself, but the combination is extremely challenging. And, of course, while meeting these requirements, we can't break the laws of physics.
MPSoCs balance these competing constraints by adapting the system's architecture to the application's requirements. Putting computational power where it is needed meets performance constraints; removing unnecessary elements reduces both energy consumption and cost.
MPSoCs are not chip multiprocessors. Chip multiprocessors are components that take advantage of increased transistor densities to put more processors on a single chip, but they don't try to leverage application needs. MPSoCs, in contrast, are custom architectures that balance the constraints of VLSI technology with an application's needs.
Applications drive systems
SoCs are often enabled by standards. Complex chips usually need large markets to be economically viable. Designing a large chip typically costs $10 million to$20 million; expensive masks add to the nonrecurring costs of large chips. Large markets spread these fixed costs over more chips. Standards—multimedia, networking, communications—provide large markets with relatively stable requirements.
Standards committees often provide reference implementations. They may write these implementations in C or in a more abstract form such as Matlab. Running the reference implementation helps system designers understand the standard's nature.
Developers can use the reference implementation as a starting point. However, many reference implementations were written with flexibility, not performance or efficiency, in mind. Developers often must extensively modify a reference implementation to use as a system design.
Standards generally provide some freedom of implementation. They usually specify input and output characteristics rather than the algorithms used to generate the data. This allows designers to differentiate their system by improving its quality or lowering its power consumption.
This also means that designers might develop algorithms at the same time as the architecture. Algorithm designers need good estimates of implementation properties, such as power consumption and performance, to guide their decisions. System architects need to provide enough flexibility in the architecture design—programmability, bandwidth, and so on—to accommodate last-minute changes to algorithms.
Specialized processing elements
Many applications can take advantage of specialized CPU instructions. A specialized instruction can speed up an operation while still providing the flexibility of a programmable processor. But if designing a custom CPU is hard, designing the compiler and other software support that goes with it is even harder.
A configurable processor is designed to be extensible in a number of ways: instruction set, word width, cache size, and so on. Given a set of refinements to the architecture, tools generate both the CPU hardware design and the compiler, debugger, and so on to go with that processor.
Hardwired accelerators are an alternative, complementary way to efficiently execute operations. Some functions are used regularly but in larger chunks than instructions. For example, the 8 × 8 discrete cosine transform is widely used in a number of image and video applications. An accelerator could perform the entire function without intervention.
Standards often provide primitive operations that can be performed relatively independently and do not require the flexibility of full software implementations. For such functions, accelerators are often the lowest power implementation available.
Memory systems
MPSoCs often have heterogeneous memory systems. Some blocks of memory may be accessible by only one or a few processors. There can be several specialized memory blocks in a system, along with some more widely accessible memory. Heterogeneous memory systems are harder to program because the programmer must keep in mind what processors can access what memory blocks. But irregular memory structures are often necessary in MPSoCs.
One reason that designers resort to specialized memory is to support real-time performance. A block of memory that only some of the processors can access makes programming more difficult, but it makes it easier to figure out how the processors can interfere as they access the memory. Unpredictable memory access times can make it impossible for processing elements to finish tasks by their deadlines.
Having fewer processing elements with access to a block of memory reduces interference with time-critical accesses. Sharing many data and program elements in a large common memory makes analyzing memory system performance more difficult.
Although tools for memory system analysis are improving, system architects often build in predictability with custom memories. Custom memory architectures can reduce energy consumption, and smaller memory blocks consume less power. Smaller buses and interconnect structures, spanning fewer processing elements, also consume less power.
The complex applications that run on MPSoCs show wide variations in data loads during execution. The intersubsystem communication network design and usage are important factors that fix the performance and cost of the overall resulting design. Although considerable research effort has been directed to interconnects, this problem is still far from being solved in the design of specific multiprocessor architectures.
Existing work comes from two different communities:

Classical computer architecture with the system bus concept. Much work has been done on bus architectures and arbitration strategies. This work is generally tightly coupled with the memory architecture. The key problem with the bus architecture is scaling when it incorporates a massive number of processors.

Networking, which generated the concept of networks on chips. Key NoC concepts include distributing the communication structure and using multiple routes for data transfer. This allows creating flexible, programmable, and even reconfigurable networks. NoCs allow targeting MPSoC platforms to a much wider variety of products.

Challenges and Opportunities
Many MPSoCs have already been designed, but advances in Moore's law present continuing challenges. MPSoCs combine the difficulties of building complex hardware systems and complex software systems.
Methodology is critical to MPSoC design. Designers of general-purpose processors do not rely as much on explicit methodologies, but the time pressures under which they design MPSoC designs often demand a more structured approach. Because they design MPSoCs more frequently than microprocessors, developers have enough experience to develop and refine their design methodologies.
Methodologies that work offer many advantages. They decrease the time it takes to design a system; they also make it easier to predict how long the design will take and how many resources it will require. Methodologies also codify techniques for improving performance and power consumption that developers can apply to many different designs.
MPSoC methodologies will necessarily be a moving target for the next decade, and they must be constantly tweaked: New technologies change the underlying components; new tools support new approaches to design.
MPSoC hardware architectures present challenges in all aspects of the multiprocessor: processing elements, memory, and interconnects.
Configurable processors with customized instruction sets are one way to improve the characteristics of processing elements; hardware/software co-design of accelerators is another technique.
Researchers have been studying memory systems for symmetric multiprocessors for many years, and MPSoCs often use heterogeneous memory systems to improve real-time performance and power consumption. However, designers often make memory system design choices based on insufficient data and inadequate design space exploration.
As designs move from buses to more general networks, developers must navigate a much larger design space. NoCs provide one promising way to modularize the interconnect design in very large chips.
A critical question in MPSoC architectures is the balance between programmability and efficiency. It's more difficult to program multiprocessors than uniprocessors, and it's more difficult to program heterogeneous multiprocessors than homogeneous multiprocessors.
MPSoCs are heterogeneous in just about every way possible: multiple instruction sets, hardwired function units, heterogeneous memory systems and address spaces, interconnects of varying bandwidth and with less than fully connected topology.
The choice of programmability features in an MPSoC is an art—and a not very well understood one at that. A more principled approach to the choice of programmable structures would be a welcome help to many MPSoC designers.
Modern ASIC design relies heavily on synthesis, both at the logic and physical levels. MPSoC design today uses more simulation and hand analysis than ASIC designs.
We do not have tools that will automatically turn a large application into a complete design. Even within a fairly narrow application, developing these complex tools is difficult, and the market for such tools is too limited to make them economically feasible. Furthermore, MPSoC-enabled applications increasingly run the gamut from cryptography to video, making it difficult to narrow the domain in which a tool must work.
Because designers have fewer synthesis tools, they need excellent simulation tools, but not enough good MPSoC simulators exist. Most multiprocessor simulators developed for general-purpose computer design can simulate only heterogeneous multiprocessors. It is often extremely difficult to modify such simulators to handle heterogeneous processing elements, memory, and interconnects.
MPSoC designers need simulators that can help them characterize their systems at multiple levels of abstraction: functional simulators for software analysis and debugging; cycle-accurate simulators for detailed performance analysis; and power simulators for energy analysis.
Methodologies must also take into account the system specification's source. Standards committees often provide reference implementations of their standards. The good news is that these are executable specifications that researchers can run and analyze. The bad news is that they are often programmed in a style that is not well suited to SoC implementations.
Reference implementations often use dynamic memory management (such as malloc in Unix) in ways that simplify the life of the reference implementation's programmer but cause performance problems in SoCs. They also often are not optimized for performance or for power consumption. Thus, the SoC design team often must spend a considerable amount of time optimizing the code even before they target it for their particular MPSoC.
The Breakthrough: HW/SW interface codesign
The key issue when integrating an MPSoC's parts is creating a continuum between embedded software and hardware. This requires new technologies allowing HW/SW interface codesign to integrate embedded SW and HW components. The discipline of HW/SW interface codesign requires using specific programming models that can hide sophisticated HW/SW interfaces.
Most conventional parallel programming models that are used on general-purpose designs—such as the message passing interface, OpenMP, BSP, LogP, and so on—primarily target single processor subsystems and homogeneous MPSoCs like SIMD. MPSoCs require more complex HW/SW interface descriptions to describe the interactions between heterogeneous subsystems. MPSoC application programming interfaces need to specify the application-specific design constraints such as quality of service (QoS) in terms of energy consumption, runtime, cost, and reliability. The challenge will be abstracting complex heterogeneous multiprocessor platforms.
Mastering HW/SW interfaces codesign opens new vistas that will bring fundamental improvements to the design process:

• concurrent design of both hardware and embedded software, leading to a shorter time-to-market;

• modular design of hardware and software components, leading to clearer design choices when building complex systems; and

• easier global validation of embedded systems including hardware and embedded software, leading to increased reliability and improved quality of service.

Additionally, HW/SW interface codesign makes several technical challenges more tractable:
HW/SW interface codesign ensures embedded software quality. Embedded software relies on the hardware platform to support complex QoS requirements and high reliability. Current practice uses an existing OS or middleware to validate nonfunctional properties of application software. These validation methods generally support real-time and delay requirements. Unfortunately, they do not support other nonfunctional properties such as intersubsystem communication bandwidth, jitter, and reliable communication.
Currently, developers validate embedded software QoS requirements only when the physical hardware prototype becomes available. This does not allow monitoring to guarantee the required QoS in a systematic manner. In such a scenario, it is even difficult to guarantee the reliability of HW/SW interface design itself. Overcoming this challenge requires a QoS-aware HW/SW interface abstraction.
Abstract HW/SW interfaces allow verification early in the design process. Verification of the HW/SW interface itself imposes new challenges. It is not sufficient to verify the interface independent of its context—the interface must be verified relative to a given hardware platform. Of course, researchers can't wait until the hardware prototype is available to carry out this verification. Researchers can use an abstract HW/SW interface to verify the correctness of the interface abstract design without using the physical prototype.
Abstract HW/SW interfaces allow using standards. A fairly general HW/SW interface model can be useful and applicable in a variety of applications. The advantage of such a general model is that developers can reuse application software, hardware components, and platform and middleware modules across different products, product families, and even application domains.
The drawback that accompanies generality is inefficiency. For applications that require only a small subset of the complete HW/SW interface functionality, a generic model carries a tremendous overhead that cost-sensitive applications can't tolerate. Using an abstract HW/SW interface architecture that is highly configurable and parameterized mitigates this problem. Using an interface allows optimizing and streamlining an instance of the interface to the particular needs of a given application. This is a central advantage of the HW/SW interface concept because without an efficient method to configure and optimize the HW/SW interface, the embedded system design cannot be mastered.
Abstract HW/SW interfaces facilitate interoperability. Abstract HW/SW interfaces enable dialogues between separate teams that can belong to different companies or even different market sectors. For example, a car maker could use a standard HW/SW interface API to develop the vehicle's software while delaying selection of the hardware platform as long as possible.
In This Issue
This issue contains four interesting articles on MPSoC design.
In "Parallelism and the ARM Instruction Set Architecture," John Goodacre and Andrew N. Sloss trace the development of the ARM architecture. Their article shows how the original RISC architecture was tuned over several generations to provide features important for embedded applications. It also describes how developers will use new multiprocessors in future generations of high-performance, low-power systems.
"Configurable Processors: A New Era in Chip Design," by Steve Leibson and James Kim, looks at the configurable processor approach to SoC design, which leverages the benefits of nanometer silicon lithography with relatively little manual effort. SoC designers can customize configurable processors to connect multiprocessors, providing even more computing power.
In "An Open Platform for Developing Multiprocessor SoCs," Mario Diaz Nava and coauthors describe a low-cost modular approach that uses emulation as an alternative to software simulation for the design and verification of complex MPSoC designs.
"Evaluating Digital Entertainment System Performance," by Marcus Levy, describes the process of benchmark design for systems-on-chips. Because of the complex nature of modern applications, benchmark design is a critical problem for both MPSoC designers and customers.
The Embedded Systems column in this issue also contributes to this look at MPSoCs. "Absolutely Positively on Time: What Would It Take?" by Edward A. Lee of UC Berkeley challenges us to develop design methodologies that will ensure that our systems operate on time.
Conclusion
The industrial perspective of the articles in this issue shows the rapid advances that are being made in this field for deployment in products. This issue focuses more on hardware than on software. We could fill an entire issue just on software challenges in SoCs.
Ahmed Jerraya is research director at TIMA Laboratory, Grenoble, France. His research interests focus on multiprocessor system-on-chip specification, validation, and design. Jerraya received a PhD in computer sciences from the University of Grenoble. He is a member of the IEEE, SIGDA, and EDAA. Contact him at ahmed.jerraya@imag.fr.
Hannu Tenhunen is a professor of electronic system design at the Royal University of Technology, Stockholm, Sweden. His research interests include interconnect-centric design, mixed-signal system integration, and nanoelectronic systems. He received a PhD in electrical engineering from Cornell University. He also received an honorary doctor degree (Dr.h.c.) from Tallinn Technical University, Estonia, and holds honorary professor appointments from Fudan University and Bejing Jiatong University, China. Contact him at hannu@imit.kth.se.
Wayne Wolf is a professor in the Department of Electrical Engineering at Princeton University. His research interests include embedded computing, multimedia systems, VLSI, and computer-aided design. Wolf received a PhD in electrical engineering from Stanford University. He is a Fellow of the IEEE and the ACM. Contact him at wolf@ princeton.edu.