This Article 
 Bibliographic References 
 Add to: 
Quantifying the Performability of Cluster-Based Services
May 2005 (vol. 16 no. 5)
pp. 456-467

Abstract—In this paper, we propose a two-phase methodology for systematically evaluating the performability (performance and availability) of cluster-based Internet services. In the first phase, evaluators use a fault-injection infrastructure to characterize the service's behavior in the presence of faults. In the second phase, evaluators use an analytical model to combine an expected fault load with measurements from the first phase to assess the service's performability. Using this model, evaluators can study the service's sensitivity to different design decisions, fault rates, and other environmental factors. To demonstrate our methodology, we study the performability of a multitier Internet service. In particular, we evaluate the performance and availability of three soft state maintenance strategies for an online bookstore service in the presence of seven classes of faults. Among other interesting results, we clearly isolate the effect of different faults, showing that the tier of Web servers is responsible for an often dominant fraction of the service unavailability. Our results also demonstrate that storing the soft state in a database achieves better performability than storing it in main memory (even when the state is efficiently replicated) when we weight performance and availability equally. Based on our results, we conclude that service designers may want an unbalanced system in which they heavily load highly available components and leave more spare capacity for components that are likely to fail more often.

[1] C. Amza, E. Cecchet, A. Chanda, A. Cox, S. Elnikety, R. Gil, J. Marguerite, K. Rajamani, and W. Zwaenepoel, “Specification and Implementation of Dynamic Web Site Benchmarks,” Proc. Fifth Ann. Workshop Workload Characterization, Nov. 2002.
[2] C. Amza, A. Cox, and W. Zwaenepoel, “Conflict-Aware Scheduling for Dynamic Content Applications,” Proc. Fourth USENIX Symp. Internet Technologies and Systems, Mar. 2003.
[3] M. Aron, D. Sanders, P. Druschel, and W. Zwaenepoel, “Scalable Content-Aware Request Distribution in Cluster-Based Network Servers,” Proc. USENIX 2000 Technical Conf., June 2000.
[4] S. Asami, “Reducing the Cost of System Administration of a Disk Storage System Built from Commodity Components,” Technical Report CSD-00-1100, Univ. of California, Berkeley, June 2000.
[5] BEA, BEA WebLogic,, Sept. 2003.
[6] E. Brewer, “Lessons from Giant-Scale Services,” IEEE Internet Computing, July/Aug. 2001.
[7] A. Brown and D.A. Patterson, “Towards Availability Benchmarks: A Case Study of Software RAID Systems,” Proc. 2000 USENIX Ann. Technical Conf., June 2000.
[8] E.V. Carrera and R. Bianchini, “Efficiency vs. Portability in Cluster-Based Network Servers,” Proc. Eighth Symp. Principles and Practice of Parallel Programming, June 2001.
[9] Cisco, Failover Configuration for LocalDirector, 2003. 400/techlocdf_wp.htm.
[10] A. Fox, S. Gribble, Y. Chawathe, and E. Brewer, “Cluster-Based Scalable Network Services,” Proc. 16th ACM Symp. Operating Systems Principles, Oct. 1997
[11] S. Garg, A. Puliafito, M. Telek, and K.S. Trivedi, “Analysis of Preventive Maintenance in Transactions-Based Software Systems,” IEEE Trans. Computers, vol. 47, no. 1, pp. 96-107, Jan. 1998.
[12] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.
[13] S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and D. Culler, “Scalable, Distributed Data Structures for Internet Service Construction,” Proc. Fourth USENIX Symp. Operating Systems Design and Implementation, pp. 319-332, Oct. 2000.
[14] S.D. Gribble, M. Welsh, R. von Behren, E.A. Brewer, D. Culler, N. Borisov, S. Czerwinski, R. Gummadi, J. Hill, A. Joseph, R. Katz, Z. Mao, S. Ross, and B. Zhao, “The Ninja Architecture for Robust Internet-Scale Systems and Services,” J. Computer Networks, vol. 35, no. 4, Mar. 2001.
[15] F. Hanik, In Memory Session Replication in Tomcat4, , Apr. 2002.
[16] T. Heath, R. Martin, and T.D. Nguyen, “Improving Cluster Availability Using Workstation Validation,” Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, June 2002.
[17] IBM, IBM WebSphere,, Sept. 2003.
[18] M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, “Failure Data Analysis of a LAN of Windows NT Based Computers,” Proc. 18th Symp. Reliable and Distributed Systems, Oct. 1999.
[19] X. Li, R.P. Martin, K. Nagaraja, T.D. Nguyen, and B. Zhang, “Mendosus: A SAN-Based Fault-Injection Test-Bed for the Construction of Highly Available Network Services,” Proc. First Workshop Novel Uses of System Area Networks, Jan. 2002.
[20] B. Ling and A. Fox, “A Self-Tuning, Self-Protecting, Self-Healing Session State Management Layer,” Proc. Fifth Ann. Workshop Active Middleware Services, June 2003.
[21] D.D.E. Long, J.L. Carroll, and C.J. Park, “A Study of the Reliability of Internet Sites,” Proc. 10th Symp. Reliable Distributed Systems, pp. 177-186, Sept. 1991.
[22] J.F. Meyer, “Performability Evaluation: Where It Is and What Lies Ahead,” Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 334-343, Apr. 1995.
[23] Microsoft, ASP.NET, http:/, Sept. 2003.
[24] B. Murphy and B. Levidow, “Windows 2000 Dependability,” Technical Report MSR-TR-2000-56, Microsoft Research, June 2000.
[25] K. Nagaraja, R. Bianchini, R. Martin, and T.D. Nguyen, “Using Fault Model Enforcement to Improve Availability,” Proc. Second Workshop Evaluating and Architecting System Dependability, Oct. 2002.
[26] K. Nagaraja, N. Krishnan, R. Bianchini, R. Martin, and T.D. Nguyen, “Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services,” Proc. Ninth Symp. High Performance Computer Architecture, Feb. 2003.
[27] K. Nagaraja, N. Krishnan, R. Bianchini, R.P. Martin, and T.D. Nguyen, “Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Services,” Proc. Fourth USENIX Symp. Internet Technologies and Systems, Mar. 2003.
[28] D.A. Patterson et al., “Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies,” Technical Report CSD-02-1175, Univ. of California, Berkeley, Mar. 2002.
[29] Rice Univ., DynaServer Project, , 2003.
[30] Y. Saito, B.N. Bershad, and H.M. Levy, “Manageability, Availability and Performance in Porcupine: A Highly Scalable Internet Mail Service,” Proc. 17th ACM Symp. Operating Systems Principles, pp. 1-15, Dec. 1999.
[31] K. Shen, H. Tang, T. Yang, and L. Chu, “Integrated Resource Management for Cluster-Based Internet Services,” Proc. Fifth USENIX Symp. Operating Systems Design and Implementation, Dec. 2002.
[32] D. Siewiorek, J. Hudakund, B. Suh, and Z. Segall, “Development of a Benchmark to Measure System Robustness,” Proc. 23rd Int'l Symp. Fault-Tolerant Computing, pp. 88-97, June 1993.
[33] R.M. Smith, K.S. Trivedi, and A.V. Ramesh, “Performability Analysis: Measures, an Algorithm, and a Case Study,” IEEE Trans. Computers, vol. 37, no. 4, Apr. 1998.
[34] N. Talagala and D. Patterson, “An Analysis of Error Behaviour in a Large Storage System,” Proc. 1999 Workshop Fault-Tolerant Parallel and Distributed Systems, Apr. 1999.
[35] Transaction Processing Performance Council, TPC-W, http:/, 2003.
[36] T.K. Tsai, R.K. Iyer, and D. Jewitt, “An Approach Towards Benchmarking of Fault-Tolerant Commercial Systems,” Proc. Symp. Fault-Tolerant Computing, pp. 314-323, June 1996.

Index Terms:
Performance, availability, fault tolerance, Internet services.
Kiran Nagaraja, Gustavo Gama, Ricardo Bianchini, Richard P. Martin, Wagner Meira Jr., Thu D. Nguyen, "Quantifying the Performability of Cluster-Based Services," IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 5, pp. 456-467, May 2005, doi:10.1109/TPDS.2005.61
Usage of this product signifies your acceptance of the Terms of Use.