37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07) RAS by the Yard Edinburgh, UK June 25-June 28 ISBN: 0-7695-2855-4
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DSN.2007.80
Different applications require different levels of fault tolerance. Therefore, it is important to create a flexible architecture that allows a customer to choose the appropriate amount of fault tolerance, a concept we call "RAS by the yard." In this paper we describe a next generation supercomputer and the design flexibility that allows us to offer a range of alternatives for RAS (reliability, availability, serviceability). In particular we explain how checkpointing can provide an availability continuum. Design alternatives that improve RAS may be expensive, so it is important to do cost/benefit studies of the alternatives. For a fixed budget and specified system balance ratios, such as Bytes/FLOPS, we analyze the system performance impact of alternative RAS strategies. We show how to optimize the amount of RAS purchased by using a performability measure.
Citation:
Alan Wood, Swami Nathan, "RAS by the Yard," dsn, pp.606-611, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07), 2007 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||