<p>The US portion of possibly the largest distributed system in the world, the PSTN is also among the most reliable. But why? The author studied outage records maintained by the US Federal Communications Commission to find out. The FCC data tallies the number and duration of outages as well as the number of customers affected. Although such data are interesting as separate entities, it became clear that a meaningful comparison required their combination. The author used the concept of customer minutes-the number of affected customers multiplied by the length of the outage in minutes-to provide this comparison. The author also used a classification technique that is general enough to compare with other large distributed systems. Major categories of failure include human error, acts of nature, hardware and software failures, overloads, and vandalism. </p> <p>Analysis of the FCC records shows that although human error causes almost half the outages, such outages are relatively short. Overloads-outages accepted as a cost-performance trade-off for the telephone system-affect the most customers for the most minutes. </p> <p>The capability for human intervention is one key to the PSTN's reliability. This system also achieves an excellent trade-off between loose coupling and interaction simplicity, showing that the balance between these characteristics is an important consideration in any distributed-system design. Telephone switch manufacturers also produce extremely reliable software. </p>

D. R. Kuhn, "Sources of Failure in the Public Switched Telephone Network," in Computer, vol. 30, no. , pp. 31-36, 1997.
