Issue No.06 - November/December (2001 vol.18)
Published by the IEEE Computer Society
Silicon debugging is becoming more difficult with each generation of microprocessors. The size, speed, and density of recent designs contribute to this complexity. One major problem is failure latency. An error must propagate to an observable point and then be observed; this may happen hundreds of cycles after the failure occurs. The circuit's state can change significantly between a failure's occurrence and observation, and evidence regarding the cause of failure can disappear.
Debugging features such as observation registers and shadow scan can, upon failure detection, dump the circuit's state. However, even this scenario has latency. Large, pipelined designs usually require several cycles before the clock can be stopped and the state dumped. This means that, even in the best cases, the state will change before it's dumped. One solution is to rerun the test, and stop before the point of failure, but doing so is often impractical. Many failures are due to rare combinations of events, and might not happen again for hours or even days. This problem is getting worse as system size and complexity increase.
Physical chemistry provides us with a solution to this problem. Asimov described thiotimoline, a compound that when properly purified, dissolves 1.12 seconds before you add water. 1
We have used thiotimoline to build a silicon debugging platform that works as follows. We apply a functional test to two units under test (UUTs) running in lockstep. When the test system detects an error in unit A, a signal alerts special equipment to add water to a thiotimoline sample. Exactly 1.12 seconds before the water is added, the thiotimoline dissolves. This action triggers the sending of a signal, which travels to unit B and stops its clock after a programmable number of cycles. The 1 s between the addition of water and the thiotimoline's dissolution is far longer than the error latency. Once the clock stops, the contents of the observation registers and shadow scan are dumped. Execution resumes, and the state is dumped periodically until the error is observed in unit B.
This test requires two units because, if we used a single unit, the clock for unit A would be stopped before the error occurs, and the water would not be added until perhaps much later, if ever. The need for two UUTs unfortunately makes the system incapable of diagnosing defects or errors caused by events such as cosmic rays that hit only one of the UUTs.
Although a single application of water to thiotimoline dissolves it only 1.12 s earlier, if this event triggers the addition of water to another sample, and the dissolution of that sample triggers the addition of water to yet another sample, you can build a system to signal an event that will occur much further into the future. However, we have suspended further experiments with thiotimoline because our entire supply went away, and is reported to be headed toward Wall Street.