Issue No.01 - January/February (2010 vol.12)
Published by the IEEE Computer Society
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MCSE.2010.10
<p>It seems to me that this is an excellent time for reseacrh in theory and algoritms for detection and correction of soft errors.</p>
The Word "Ephemera" Derives From Ancient Greek, Originally Meaning Things Lasting no More Than a Day. The Connection to Computational Science and Engineering Will Become Clear Soon, Certainly in Under a Day.
One of the less interesting lectures I've ever heard was on the topic of error detection and correction. This wasn't because the subject is dull. In fact, detecting and correcting errors in digital data involves a fascinating set of ideas and techniques, without which computers and indeed all telecommunications would be impossible. The talk I heard was less than great because the speaker decided that the way to introduce an audience of nonspecialists to the subject was by reciting the excruciating details of a particular computation. "If this number is a one and this number is a one then we add them and get a zero. And then we look at the next bit …" As many CiSE readers know, the idea is to send extra bits and then do a short computation on the received data that will tell you if what was received is what was sent. How exactly this is done and why it works is a deep subject dating back at least to the work of Claude Shannon.
Shannon's techniques are robust and quite reliable. By the time computers had been around for 15 years, machine errors were extremely rare. A good friend tells me that in 1965 he got into a very heated argument with a fellow graduate student over whether a certain program was buggy or a true machine error had occurred. It got so bad that the guy who had the "bug infested" program finally stomped off in a huff. The argument was never settled because the next time the program was run—without any changes—it ran correctly. No error was found in either the program or the hardware. Possibly this was an example of a "soft" error.
According to Wikipedia, a soft error is data that's wrong, but not because of a programming mistake or hardware failure.
After a soft error, there's no implication that the system is any less reliable than before. If the data is rewritten, the circuit will work perfectly again. Soft errors involve changes to data—the electrons in a storage circuit, for example—but not changes to the physical circuit itself, the atoms. A bit is flipped and the cause of the flip vanishes. The error is ephemeral.
Soft errors are thought to be caused by an alpha particle passing through exactly the right part of the circuit at exactly the right time. Soft errors used to be extremely rare events. But now that terabytes of data are accessed and feature sizes on chips are well below 100 nanometers, soft errors seem to be cropping up. They're certainly being offered more and more as a reason for program execution failure.
It seems to me that this is an excellent time for research in theory and algorithms for detection and correction of soft errors. I suspect this will be much harder than the classical theory. Let's call it the hard science of soft errors.
Yesterday upon the stair
I saw a man who wasn't there
He wasn't there again today
Oh how I wish he'd go away.
This verse, "Antigonish," was written by William Hughes Mearns in 1899.
I'm grateful to Francis Sullivan for his interesting comments and insight on this subject.Selected articles and columns from IEEE Computer Society publications are also available for free at http://ComputingNow.computer.org.