Issue No.03 - March (2004 vol.37)
Published by the IEEE Computer Society
<p>Meeting emerging computer design industry challenges requires finding a balance between continuing to apply old technology beyond the point where it is workable and prematurely deploying new technology without knowing its limitations.</p>
We hardware engineers believe in some things that aren't quite true. We call ourselves electrical engineers because we think we understand electrical current—the flow of electrons in conductors, semiconductors, and other materials, under interesting external conditions such as electric or magnetic fields and temperature gradients. The reality is that we have a practical working understanding of what large numbers of electrons will do en masse. But for single electrons, we can't even say where they are, and physicists would chide us for asking the question. Our knowledge is statistical, not absolute. We stack the odds in our favor by employing very large numbers of electrons in our circuits.
As always in engineering, there are limits to what we know. We're used to this—it feels natural. But to physicists, defining the border between what is known and what is unknown is irresistible. They may fret over whether Schrödinger's cat is alive, dead, or both; engineers will look at the airtight box, then at their watches, wait, and then confidently say, "Dead. What's the next item on the agenda?"
Exceeding the Safe Zone
Our pragmatism can work against us. Because our collective knowledge is bounded—yet we rightly place great value on our experiences (successes and failures)—we are constantly in danger of accidentally going beyond that safe zone. One way to hurt ourselves is to prematurely deploy new implementation technology before we've comprehended its unique limitations; the other is to continue applying old technology to new problems beyond the point where it is workable.
Universities teach students to practice a conservative approach to engineering called "worst-case design." A bridge designer can tell you the heaviest load a bridge has been designed to carry; a locomotive designer can tell you the gross weight of a railroad engine. Knowing those two things, you can quickly determine whether a train with three engines, crossing a bridge that spans three engine lengths, is likely to reach its intended destination or take a fast vertical detour. Those engines could be lighter than specified, or the bridge designer may actually have designed the bridge to two or three times the actual rated load. You don't necessarily know those safety margins, but you do know that if they tell you a worst-case number, you can reasonably expect to use that number successfully. You should also check the assumption that the engine is the heaviest part of the train and ensure that the bridge isn't covered with heavy ice.
Better Than Worst-case Design
Digital design engineers are used to reading data books or cell libraries that tell them the worst-case behavior of the components they are designing into a system. Knowing how fast or slow a given component will be under normal operating conditions, the designer can stack these numbers end to end to get the minimum and maximum propagation delay times through a logic chain.
But is that really possible? Silicon chips are made in a chemical/mechanical fabrication process, tested, and assigned to various "speed bins." These chips don't all run at the same speed, despite the best attempts by designers and production engineers to make them do so. Instead, a distribution of speeds occurs, with many chips running at a middling clock rate, a few much faster, and some much more slowly—or not at all, which is yet another statistical distribution governing the process.
So when we talk about the worst-case design numbers for a given chip, we're really referring to some propagation delay at which the chip's manufacturer hopes to achieve enough yield for a profitable part. Most manufacturers do not test every chip across all axes—thermal temperature, clock rates, loading, and full test vector set—because that is too expensive, and experience suggests it's unnecessary. Instead, they do statistical sampling, using a small number of real chips to predict the behavior of all the chips. Then they add a safety factor to cover what they don't know, but they don't tell you what that safety factor is. They don't know precisely what that margin is; that is the nature of a safety margin, and its existence is a universal constant throughout all engineering.
We computer designers have been living well for the past 35 years. Except for the advanced Schottky debacle of the late 1980s, chips have behaved as their worst-case parameters suggested, and designs incorporating those chips and observing those parameters are likely to work as intended. So we've designed incredibly complicated microprocessors with a hundred million transistors, most of which must work correctly every single time or the processor will make a nonrecoverable error.
There are clouds gathering on the horizon, though.
In This Issue
As Naresh R. Shanbhag points out in "Reliable and Efficient System-on-Chip Design," virtually all of the technology development trends today point in the wrong direction: Thermal and leakage power are growing exponentially; system noise sources are increasing, while voltage output is decreasing (and noise margins along with it); yet there is still a strong desire to improve system performance.
Microprocessor-circuit engineers have been grappling with noise for at least a decade, but so far architects and microarchitects have been able to ignore it. While designing in the presence of noise may be novel to us in the computer field, it's a staple item in the communications field, and they may have useful techniques for us to consider. Shanbhag's solution to the problem is to employ information theory to determine achievable bounds on energy/ throughput efficiency and to develop noise-tolerance techniques at the circuit and algorithmic levels to approach these bounds.
In "Going Beyond Worst-Case Specs with TEAtime," Gus Uht proposes an idea whose time may be here. TEAtime suggests that if critical paths in a design were shadowed by checker circuits, carefully engineered to be slightly slower than the critical path itself and designed to detect failures in the checker circuit, the resulting machine could run substantially faster or at lower supply voltages.
While Uht shows a way to shave operating margins while still maintaining error-free operation, in "Making Typical Silicon Matter with Razor," Todd Austin and colleagues propose that "if you aren't failing some of the time, you aren't trying hard enough," a sentiment that I have seldom seen (purposely) applied to an engineering endeavor! The Razor design incorporates self-checking circuits at the flip-flop level to permit pushing clock frequency and supply voltages beyond normal worst-case levels. Razor's premise is that monitoring a microprocessor's real-time operation and recovering from detected errors would effectively subtract out the accumulated baggage of most of the safety margins implicit in all levels of the machine.
"Speeding Up Processing with Approximation Circuits" by Shih-Lien Lu addresses the general question of how to design circuits and functions to accomplish their tasks without the burden of worst-case design.
Some serious challenges are emerging in the computer design industry. These articles provide a tantalizing and sometimes scary look at a possible shape of things to come. You've heard of thinking outside the box? You can't even see the box from here.
Bob Colwell, Intel's chief IA32 architect through the Pentium II, III, and 4 microprocessors, is now an independent consultant. Contact him at firstname.lastname@example.org.