This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fault Tolerance via Diversity for Off-the-Shelf Products: A Study with SQL Database Servers
October-December 2007 (vol. 4 no. 4)
pp. 280-294
If an off-the-shelf software product exhibits poor dependability due to design faults, software fault tolerance is often the only way available to users and system integrators to alleviate the problem. Thanks to low acquisition costs, even using multiple versions of software in a parallel architecture, a scheme formerly reserved for few and highly critical applications, may become viable for many applications. We have studied the potential dependability gains from these solutions for off-the-shelf database servers. We based the study on the bug reports available for four off-the-shelf SQL servers, plus later releases of two of them. We found that many of these faults cause systematic, non-crash failures, a category ignored by most studies and standard implementations of fault tolerance for databases. Our observations suggest that diverse redundancy would be effective for tolerating design faults in this category of products. Only in very few cases would demands that triggered a bug in one server cause failures in another one, and there were no coincident failures in more than two of the servers. Use of different releases of the same product would also tolerate a significant fraction of the faults. We report our results and discuss their implications, the architectural options available for exploiting them and the difficulties that they may present.

[1] P. Popov, L. Strigini, and A. Romanovsky, “Diversity for Off-the-Shelf Components,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '00)—Fast Abstracts Supplement, pp. B60-B61, 2000.
[2] A. Valdes, M. Almgren, S. Cheung, Y. Deswarte, B. Dutertre, J. Levy, H. Saidi, V. Stavridou, and T.E. Uribe, “An Architecture for an Adaptive Intrusion-Tolerant Server,” Proc. 10th Int'l Workshop Security Protocols Selected Papers, B. Christianson, B. Crispo, J.A.Malcolm, and M. Roe, eds., pp. 158-178, 2003.
[3] M.A. Hiltunen, R.D. Schlichting, C.A. Ugarte, and G.T. Wong, “Survivability through Customization and Adaptability: The Cactus Approach,” Proc. DARPA Information Survivability Conf. and Exposition, 2000.
[4] L. Strigini, “Fault Tolerance against Design Faults,” Dependable Computing Systems: Paradigms, Performance Issues, and Applications, H. Diab and A. Zomaya, eds., pp. 213-241, John Wiley & Sons, 2005.
[5] P. Popov, L. Strigini, S. Riddle, and A. Romanovsky, “Protective Wrapping of OTS Components,” Proc. Fourth Int'l Conf. Software Eng. Workshop Component-Based Software Eng.: Component Certification and System Prediction, 2001.
[6] H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O'Neil, and P. O'Neil, “A Critique of ANSI SQL Isolation Levels,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '95), 1995.
[7] A. Fekete, D. Liarokapis, E. O'Neil, P. O'Neil, and D. Shasha, “Making Snapshots Isolation Serializable,” ACM Trans. Database Systems, vol. 30, no. 2, pp. 492-528, 2005.
[8] F. Schneider, “Byzantine Generals in Action: Implementing Fail-Stop Processors,” ACM Trans. Computer Systems, vol. 2, no. 2, pp.145-154, 1984.
[9] I. Gashi, P. Popov, and L. Strigini, “Fault Diversity Among Off-the-Shelf SQL Database Servers,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '04), pp. 389-398, 2004.
[10] P.A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987.
[11] M. Weismann, F. Pedone, and A. Schiper, “Database Replication Techniques: A Three-Parameter Classification,” Proc. 19th IEEE Symp. Reliable Distributed Systems (SRDS '00), pp. 206-217, 2000.
[12] F. Pedone and S. Frolund, “Pronto: A Fast Failover Protocol for Off-the-Shelf Commercial Databases,” Proc. 19th IEEE Symp. Reliable Distributed Systems (SRDS '00), pp. 176-185, 2000.
[13] M. Patiño-Martinez, R. Jiménez-Peris, B. Kemme, and G. Alonso, “MIDDLE-R: Consistent Database Replication at the Middleware Level,” ACM Trans. Computer Systems, vol. 23, no. 4, pp. 375-423, 2005.
[14] Y. Lin, B. Kemme, M. Patiño-Martínez, and R. Jiménez-Peris, “Middleware-Based Data Replication Providing Snapshot Isolation,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '05), pp. 419-430, 2005.
[15] R. Jiménez-Peris and M. Patiño-Martínez, “D5: Transaction Support,” ADAPT Middleware Technologies for Adaptive and Composable Distributed Components Deliverable IST-2001-37126, Mar. 2003.
[16] R. Jiménez-Peris, M. Patiño-Martínez, G. Alonso, and B. Kemme, “Scalable Database Replication Middleware,” Proc. 22nd Int'l Conf. Distributed Computing Systems (ICDCS '02), pp. 477-484, 2002.
[17] H. Sutter, “SQL/Replication Scope and Requirements Document,” ISO/IEC JTC 1/SC 32 Data Management and Interchange WG3 Database Languages, H2-2000-568, 2000.
[18] B. Kemme and G. Alonso, “Don't Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication,” Proc. 28th Int'l Conf. Very Large Data Bases (VLDB '02), 2000.
[19] I. Gashi, P. Popov, V. Stankovic, and L. Strigini, “On Designing Dependable Services with Diverse Off-the-Shelf SQL Servers,” Architecting Dependable Systems II, R. de Lemos, C. Gacek, and A.Romanovsky, eds., pp. 191-214, Springer-Verlag, 2004.
[20] J. Gray, “Why Do Computers Stop and What Can Be Done about It?” Proc. Fifth Symp. Reliability in Distributed Software and Database Systems (SRDSDS '86), 1986.
[21] P.E. Ammann and J.C. Knight, “Data Diversity: An Approach to Software Fault Tolerance,” IEEE Trans. Computers, vol. 37, no. 4, pp. 418-425, Apr. 1988.
[22] I. Gashi and P. Popov, “Rephrasing Rules for Off-the-Shelf SQL Database Servers,” Proc. Sixth IEEE European Dependable Computing Conf. (EDCC '06), pp. 139-148, 2006.
[23] J.E. Cook and J.A. Dage, “Highly Reliable Upgrading of Components,” Proc. 21st Int'l Conf. Software Eng. (ICSE '99), pp.203-212, 1999.
[24] A.T. Tai, K.S. Tso, L. Alkalai, S.N. Chau, and W.H. Sanders, “Low-Cost Error Containment and Recovery for Onboard Guarded Software Upgrading and Beyond,” IEEE Trans. Computers, vol. 51, no. 2, pp. 121-137, Feb. 2002.
[25] T. Anderson and P.A. Lee, Fault Tolerance: Principles and Practice (Dependable Computing and Fault Tolerant Systems). Springer-Verlag, 1990.
[26] P. Popov, L. Strigini, A. Kostov, V. Mollov, and D. Selensky, “Software Fault-Tolerance with Off-the-Shelf SQL Servers,” Proc. Third Int'l Conf. COTS-Based Software Systems (ICCBSS '04), pp.117-126, 2004.
[27] R. Jiménez-Peris, M. Patiño-Martínez, and G. Alonso, “An Algorithm for Non-Intrusive, Parallel Recovery of Replicated Data and its Correctness,” Proc. 21st IEEE Symp. Reliable Distributed Systems (SRDS '02), pp. 150-159, 2002.
[28] F. Di Giandomenico and L. Strigini, “Adjudicators for Diverse-Redundant Components,” Proc. Ninth IEEE Symp. Reliable Distributed Systems (SRDS '90), pp. 114-123, 1990.
[29] D.M. Blough and G.F. Sullivan, “Voting Using Predispositions,” IEEE Trans. Reliability, vol. 43, no. 4, pp. 604-616, 1994.
[30] B. Parhami, “Voting: A Paradigm for Adjudication and Data Fusion in Dependable Systems,” Dependable Computing Systems: Paradigms, Performance Issues, and Applications, H.B. Diab and A.Y.Zomaya, eds., 2005.
[31] Y. Bao, X. Sun, and K.S. Trivedi, “A Workload-Based Analysis of Software Aging and Rejuvenation,” IEEE Trans. Reliability, vol. 54, no. 3, pp. 541-548, 2005.
[32] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.
[33] K.S. Tso and A. Avizienis, “Community Error Recovery in N-Version Software: A Design Study with Experimentation,” Proc. 17th IEEE Int'l Symp. Fault-Tolerant Computing (FTCS '87), pp.127-133, 1987.
[34] M. Castro and B. Liskov, “Practical Byzantine Fault Tolerance,” Proc. Third Symp. Operating Systems Design and Implementation (OSDI '99), pp. 173-186, 1999.
[35] P. Frankl, D. Hamlet, B. Littlewood, and L. Strigini, “Evaluating Testing Methods by Delivered Reliability,” IEEE Trans. Software Eng., vol. 24, no. 8, pp. 586-601, Aug. 1998.
[36] I. Gashi, “Fault Diversity Among Off-The-Shelf SQL Database Servers: Complete Results from Two Studies,” http://www.csr.city.ac.uk/people/ilir.gashi DBMSBugReports/, 2006.
[37] SourceForge, “Interbase (Firebird) Bug Tracker,” http://source forge.net/tracker?atid=109028&group_id=9028&func=browse , 2006.
[38] PostgreSQL, “PostgreSQL Bugs Mailing List Archives,” http://archives.postgresql.orgpgsql-bugs /, 2006.
[39] Microsoft, “List of Bugs Fixed by SQL Server 7.0 Service Packs,” http://support.microsoft.comdefault.aspx?scid=kb; EN=US;313980 , 2006.
[40] Oracle Metalink, http://metalink.oracle.com/metalink/plsql ml2_gui.startup, 2006.
[41] D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kýcýman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupman, and N. Treuhaft, “Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques and Case Studies,” Technical Report CSD-02-1175, Dept. of Computer Science, Univ. of California, Berkeley, 2002.
[42] EnterpriseDB, http:/www.enterprisedb.com/, 2006.
[43] Janus-Software, Fyracle, http:/www.janus-software.com/, 2006.
[44] B. Littlewood, P. Popov, and L. Strigini, “Modeling Software Design Diversity—A Review,” ACM Computing Surveys, vol. 33, no. 2, pp. 177-208, 2001.
[45] M.J.P. van der Meulen, P.G. Bishop, and M. Revilla, “An Exploration of Software Faults and Failure Behavior in a Large Population of Programs,” Proc. Fifth IEEE Int'l Symp. Software Reliability and Eng. (ISSRE '04), pp. 101-112, 2004.
[46] P. Popov and B. Littlewood, “The Effect of Testing on the Reliability of Fault-Tolerant Software,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '04), pp. 265-274, 2004.
[47] S.S. Brilliant, J.C. Knight, and N.G. Leveson, “Analysis of Faults in an N-Version Software Experiment,” IEEE Trans. Software Eng., vol. 16, no. 2, pp. 238-247, Feb. 1990.
[48] J.C. Knight and N.G. Leveson, “An Experimental Evaluation of the Assumption of Independence in Multi-Version Programming,” IEEE Trans. Software Eng., vol. 12, no. 1, pp. 96-109, Jan. 1986.
[49] I. Lee and R.K. Iyer, “Software Dependability in the Tandem GUARDIAN System,” IEEE Trans. Software Eng., vol. 21, no. 5, pp.455-467, May 1995.
[50] S. Chandra and P.M. Chen, “Whither Generic Recovery from Application Faults? A Fault Study Using Open-Source Software,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '00), pp. 97-106, 2000.
[51] S. Chandra and P.M. Chen, “How Fail-Stop Are Programs,” Proc. 28th IEEE Ann. Fault Tolerant Computing Symp. (FTCS '98), pp. 240-249, 1998.
[52] J. Reynolds, J. Just, E. Lawson, L. Clough, R. Maglich, and K. Levitt, “The Design and Implementation of an Intrusion Tolerant System,” Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN '02), pp. 285-292, 2002.
[53] F. Wang, F. Gong, C. Sargor, K. Goseva-Popstojanova, K. Trivedi, and F. Jou, “SITAR: A Scalable Intrusion-Tolerant Architecture for Distributed Services,” Proc. 2001 IEEE Workshop Information Assurance and Security, 2001.
[54] “Design of an Intrusion-Tolerant Intrusion Detection System,” M.Dacier, ed., MAFTIA deliverable D10, http://www.maftia. org/deliverablesD10.pdf , 2002.
[55] M. Castro, R. Rodrigues, and B. Liskov, “BASE: Using Abstraction to Improve Fault Tolerance,” ACM Trans. Computer Systems, vol. 21, no. 3, pp. 236-269, 2003.
[56] D. Powell, J. Arlat, L. Beus-Dukic, A. Bondavalli, P. Coppola, A. Fantechi, E. Jenn, C. Rabejac, and A. Wellings, “GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 580-599, June 1999.
[57] Z.T. Kalbarczyk, R.K. Iyer, S. Bagchi, and K. Whisnant, “Chameleon: A Software Infrastructure for Adaptive Fault Tolerance,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 560-579, June 1999.
[58] B. Littlewood and L. Strigini, “Validation of Ultra-High Dependability for Software-Based Systems,” Comm. ACM, vol. 36, no. 11, pp. 69-80, 1993.
[59] V. Stankovic and P. Popov, “Improving DBMS Performance through Diverse Redundancy,” Proc. 25th IEEE Symp. Reliable Distributed Systems (SRDS '06), pp. 391-400, 2006.

Index Terms:
Fault tolerance, Reliability, availability, and serviceability, Relational databases, Error processing, design diversity, COTS software, fault records, non-crash failures, database availability, experimental results
Citation:
Ilir Gashi, Peter Popov, Lorenzo Strigini, "Fault Tolerance via Diversity for Off-the-Shelf Products: A Study with SQL Database Servers," IEEE Transactions on Dependable and Secure Computing, vol. 4, no. 4, pp. 280-294, Oct.-Dec. 2007, doi:10.1109/TDSC.2007.70208
Usage of this product signifies your acceptance of the Terms of Use.