Proceedings of 37th Conference on Foundations of Computer Science (1996)
Oct. 14, 1996 to Oct. 16, 1996
D.A. Spielman , Dept. of Math., MIT, Cambridge, MA, USA
We re-introduce the coded model of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope to simulate all computational devices by devices whose size is greater by a constant factor but which are exponentially reliable even if each of their components can fail with some constant probability. We consider fine-grained parallel computations in which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time t on w processors can be performed reliably on a faulty machine in the coded model using wlog/sup 0(1/)w processors and time tlog/sup 0(1)/w. The failure probability of the computation will be at most t/spl middot/exp(-w/sup 1/4 /). The codes used to communicate with our fault-tolerant machines are generalized Reed-Solomon codes and can thus be encoded and decoded in O(nlog/sup 0(1)/n) sequential time and are independent of the machine they are used to communicate with. We also show how coded computation can be used to self-correct many linear functions in parallel with arbitrarily small overhead.
fault tolerant computing; fault-tolerant parallel computation; fault-tolerant computation; error-correcting code; coded model; parallel computation; failure probability; generalized Reed-Solomon codes; coded computation
D. Spielman, "Highly fault-tolerant parallel computation," Proceedings of 37th Conference on Foundations of Computer Science(FOCS), Burlington, VT, 1996, pp. 154.