This Article 
 Bibliographic References 
 Add to: 
Fault Detection for Byzantine Quorum Systems
September 2001 (vol. 12 no. 9)
pp. 996-1007

Abstract—In this paper, we explore techniques to detect Byzantine server failures in asynchronous replicated data services. Our goal is to detect arbitrary failures of data servers in a system where each client accesses the replicated data at only a subset (quorum) of servers in each operation. In such a system, some correct servers can be out-of-date after a write and can therefore, return values other than the most up-to-date value in response to a client's read request, thus complicating the task of determining the number of faulty servers in the system at any point in time. We initiate the study of detecting server failures in this context, and propose two statistical approaches for estimating the risk posed by faulty servers based on responses to read requests.

[1] L. Alvisi, D. Malkhi, E. Pierce, and M. Reiter, “Fault Detection for Byzantine Quorum Systems,” Proc. Seventh IFIP Int'l Working Conf. Dependable Computing for Critical Applications, pp. 357–371, Jan. 1999.
[2] E. Amoroso, Intrusion Detection: An Introduction to Internet Surveillance, Correlation, Trace Back, Traps, and Response. Intrusion.Net Books, 1999.
[3] R. Bazzi, “Synchronous Byzantine Quorum Systems,” Proc. 16th ACM Symp. Principles of Distributed Computing, pp. 259-266, 1997.
[4] R.W. Buskens and R.P. Bianchini, Jr., “Distributed On-Line Diagnosis in the Presence of Arbitrary Faults,” Proc. 23rd Int'l Symp. Fault-Tolerant Computing, pp. 470-479, June 1993.
[5] T.D. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp. 225–267, 1996.
[6] G. Chokler, D. Malkhi, and M. Reiter, “Backoff Protocols for Distributed Mutual Exclusion and Ordering,” Proc. 21st Int'l Conf. Distributed Computing Systems, Apr. 2001.
[7] S.L. Hakimi and K.-Y. Chwa, “Schemes for Fault-Tolerant Computing: A Comparison of Modularly Redundant and$t\hbox{-}{\rm{diagnosable}}$Systems,” Information and Control, vol. 49, pp. 212-238, June 1981.
[8] A. Doudou and A. Schiper, Muteness Detectors for Consensus with Byzantine Processes, Technical Report TR97-230, Dept. of Computer Science,École Polytechnic Fédérale de Lausanne, Oct. 1997.
[9] K.P. Kihlstrom, L.E. Moser, and P.M. Melliar-Smith, “Solving Consensus in a Byzantine Environment Using an Unreliable Failure Detector,” Proc. Int'l Conf. Principles of Distributed Systems, pp. 61-75, Dec. 1997.
[10] L. Lamport, “On Interprocess Communication (Part II: Algorithms),” Distributed Computing, vol. 1, pp. 86-101, 1986.
[11] M.J. Lin, A. Ricciardi, and K. Marzullo, “On the Resilience of Multicasting Strategies in a Failure-Propagating Environment,” Techinical Report,TR-1998-003, Univ. of Texas at Austin . 1998.
[12] M. Maekawa, “A$\sqrt N $Algorithm for Mutual Exclusion in Decentralized Systems,” ACM Trans. Computer Systems, vol. 3, no. 2, pp. 145-159, May 1985.
[13] J. Maeng and M. Malek, “A Comparison Connection Assignment for Self-Diagnosis of Multiprocessor Systems,” Proc. 11th Int'l Symp. Fault-Tolerant Computing, pp. 173-175, 1981.
[14] M. Malek, “A Comparison Connection Assignment for Diagnosis of Multiprocessor Systems,” Proc. Seventh Int'l Symp. Computer Architecture, pp. 31-35, 1980.
[15] D. Malkhi and M. Reiter, “An Architecture for Survivable Coordination in Large Distributed Systems,” IEEE Trans. Knowledge and Data Eng., vol. 12, no. 2, pp. 187-202, Mar./Apr. 2000.
[16] D. Malkhi and M.K. Reiter, “Byzantine Quorum Systems,” Distributed Computing, vol. 11, no. 4, pp. 203-213, 1998.
[17] D. Malkhi and M.K. Reiter, “Unreliable Intrusion Detection in Distributed Computation,” Proc. 10th IEEE Computer Security Foundations Workshop, pp. 116-124, June 1997.
[18] D. Malkhi, M.K. Reiter, and A. Wool, “The Load and Availability of Byzantine Quorum Systems,” SIAM J. Computing, vol. 29, no. 6, 2000.
[19] D. Malkhi, M. Reiter, A. Wool, and R. Wright, “Probabilistic Quorum Systems,” brief announcement appears in Proc. 17th ACM Symp. Principles of Distributed Computing, submitted for publication, pp. 321, June 1998.
[20] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge Univ. Press, 1995.
[21] K. Shin and P. Ramanathan, “Diagnosis of Processors with Byzantine Faults in a Distributed Computing System,” Proc. 17th Int'l Symp. Fault Tolerant Computing, pp. 55-60, 1987.

Index Terms:
Byzantine fault tolerance, replicated data, quorum systems, fault detection.
Lorenzo Alvisi, Dahlia Malkhi, Evelyn Pierce, Michael K. Reiter, "Fault Detection for Byzantine Quorum Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 9, pp. 996-1007, Sept. 2001, doi:10.1109/71.954640
Usage of this product signifies your acceptance of the Terms of Use.