This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
ROC-1: Hardware Support for Recovery-Oriented Computing
February 2002 (vol. 51 no. 2)
pp. 100-107

We introduce the ROC-1 hardware platform, a large-scale cluster system designed to provide high availability for Internet service applications. The ROC-1 prototype embodies our philosophy of Recovery-Oriented Computing (ROC) by emphasizing detection and recovery from the failures that inevitably occur in Internet service environments, rather than simple avoidance of such failures. ROC-1 promises greater availability than existing server systems by incorporating four techniques applied from the ground up to both hardware and software: redundancy and isolation, online self-testing and verification, support for problem diagnosis, and concern for human interaction with the system.

[1] G. Banga, “Auto-Diagnosis of Field Problems in an Appliance Operating System,” Proc. 2000 USENIX Ann. Technical Conf., 2000.
[2] A. Brown, “Accepting Failure: Availability through Repair-Centric System Design,” Univ. of California Berkeley Qualifying Exam Proposal, 2001.
[3] A. Brown and D.A. Patterson, “Embracing Failure: A Case for Recovery-Oriented Computing (ROC),” Proc. 2001 High Performance Transaction Processing Symp. (HPTS '01), 2001.
[4] A. Brown and D.A. Patterson, “To Err Is Human,” Proc. First Workshop Evaluating and Architecting System dependabilitY (EASY `01), 2001.
[5] A. Brown and D.A. Patterson, “Towards Availability Benchmarks: A Case Study of Software RAID Systems,” Proc. 2000 USENIX Annual Technical Conf., 2000.
[6] G. Candea and A. Fox, “Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel,” Proc. Eighth Workshop Hot Topics in Operating Systems (HotOS-VIII), 2001.
[7] J.D. Case, M. Fedor, M.L. Schoffstall, and C. Davin, Simple Network Management Protocol, (SNMP), RFC 1157, 1990.
[8] P.M. Chen, E.K. Lee, G.A. Gibson, R.H. Katz, and D.A. Patterson, "RAID: High-Performance Reliable Secondary Storage," ACM Computing Surveys, vol. 36, no. 3, pp. 145-185, Aug. 1994.
[9] J. Choi, M. Choi, and S. Lee, “An Alarm Correlation and Fault Identification Scheme Based on OSI Managed Object Classes,” Proc. 1999 IEEE Int'l Conf. Comm., pp. 1547-1551, 1999.
[10] Distributed Management Task Force, Inc., Web-Based Enterprise Management (WBEM) Initiative, http://www.dmtf.org/standardsstandard_wbem.php , 2001.
[11] A. Fox, S.D. Gribble, Y. Chawathe, E.A. Brewer, and P. Gauthier, “Cluster-Based Scalable Network Services,” Proc. 16th Symp. Operating System Principles, pp. 78-91, Oct. 1997.
[12] J. Goldberg, “New Problems in Fault-Tolerant Computing,” Proc. 1975 Int'l Symp. Fault-Tolerant Computing, pp. 29-34, 1975.
[13] K. Govil, D. Teodosiu, Y. Huang, and M. Rosenblum, “Cellular Disco: Resource Management Using Virtual Clusters on Shared-Memory Multiprocessors,” Proc. 17th Symp. Operating Systems Principles, 1999.
[14] J. Gray, “Why Do Computers Stop and What Can Be Done About It?” Proc. Symp. Reliability in Distributed Software and Database Systems, pp. 3-12, 1986.
[15] B. Gruschke, “Integrated Event Management: Event Correlation Using Dependency Graphs,” Proc. Ninth IFIP/IEEE Int'l Workshop Distributed Systems Operation&Management (DSOM98), 1998.
[16] J. Hamilton, “Fault Avoidance vs. Fault Tolerance: Testing Doesn't Scale,” Proc. High Performance Transaction Systems (HPTS) Workshop, 1999.
[17] Y. Huang, C. Kintala, N. Kolettis, and N.D. Fulton, Software Rejuvenation: Analysis, Module and Applications Proc. 25th IEEE Int'l Symp. Fault-Tolerant Computing, pp. 381-390, June 1995.
[18] ISO/DIS 11898, “Controller Area Network (CAN) for High Speed Communication,” 1992.
[19] S. Kätker and M. Paterok, “Fault Isolation and Event Correlation for Integrated Fault Management,” Proc. Fifth IFIP/IEEE Int'l Symp. Integrated Network Management (IM V), pp. 583-596, 1997.
[20] R. Kembel, The Fibre Channel Consultant: A Comprehensive Introduction. Northwest Learnng Assoc., 1998.
[21] D.R. Kuhn, “Sources of Failure in the Public Switched Telephone Network,” Computer, vol. 30, no. 4, Apr. 1997.
[22] J. Menn, “Prevention of Online Crashes Is No Easy Fix,” Los Angeles Times, p. C-1, 2 Dec. 1999.
[23] A.C. Merenda and E. Merenda, “Recovery/Serviceability System Test Improvements for the IBM ES/9000 520 Based Models,” Proc. 1992 Int'l Symp. Fault-Tolerant Computing, pp. 463-467, 1992.
[24] B. Murphy and T. Gent, “Measuring System and Software Reliability Using an Automated Data Collection Process,” Quality and Reliability Eng. Int'l, vol. 11, pp. 341-353, 1995.
[25] “Ninja: A Framework for Network Services,” submission to the 18th Symp. Operating System Principles (SOSP), 2001.
[26] C. Perrow, Normal Accidents. Princeton Univ. Press, 1999.
[27] Human Detection and Diagnosis of System Failures: Proc. NATO Symp. Human Detection and Diagnosis of System Failures, J. Rasmussen and W. Rouse, eds., 1981.
[28] J. Reason, Human Error. Cambridge Univ. Press, 1990.
[29] RLX Tech nologies, “Redefining Server Economics,” RLX Technologies white paper,http:/www.rocketlogix.com/, 2001.
[30] SPEC, Inc., SPECmail 2001, http://www.spec.org/osgmail2001/, 2001.
[31] A. Steininger and C. Scherrer, “On the Necessity of On-Line-BIST in Safety-Critical Applications—A Case-Study,” Proc. 1999 Int'l Symp. Fault-Tolerant Computing, pp. 208-215, 1999.
[32] Sun Microsystems, Inc., “Java Management Extensions JMX,” Preliminary Specification Draft 1.9, 1999.
[33] T. Sweeney, “No Time for DOWNTIME—IT Managers Feel the Heat to Prevent Outages that Can Cost Millions of Dollars,” InternetWeek, no. 807, 3 Apr. 2000.
[34] J. Waddle and M. Walker, “Power Dependence Determination with Powerline Networking,” Project for UC Berkeley CS252,http://www.cs.berkeley.edu/~mwalkerpowernet.html , May 2001.
[35] S. Yemini et al., “High Speed and Robust Event Correlation,” IEEE Comm. Magazine, vol. 34, no. 5, pp. 82-90, May 1996.

Index Terms:
Availability, fault tolerance, fault diagnosis, Internet, network servers, computer network management.
Citation:
D. Oppenheimer, A. Brown, J. Beck, D. Hettena, J. Kuroda, N. Treuhaft, D.A. Patterson, K. Yelick, "ROC-1: Hardware Support for Recovery-Oriented Computing," IEEE Transactions on Computers, vol. 51, no. 2, pp. 100-107, Feb. 2002, doi:10.1109/12.980002
Usage of this product signifies your acceptance of the Terms of Use.