The Community for Technology Leaders
Traditional cloud stacks are designed to tolerate server or rack-level failures, that are unpredictable and uncorrelated. Such stacks successfully deliver highly-available cloud services at global scale. The increasing criticality of cloud services to the overall world economy is causing concern about the impact of power outages, cyber-attacks, configuration errors, or other causes of datacenter or larger-scale failures on cloud availability. Recent experience shows that these events can trigger cascading failures and global-scale service outages. We study the impact of correlated, datacenter resource failures, exploring distributed protocols (widely-used in Cassandra) across varied configurations and resource availability. Our study reveals that using such protocols to achieve high availability on resources with large-scale, correlated outages are costly in storage and update traffic, requiring replication factors of 10 or more. Further analysis reveals that this limitation arises from from inflexible replication and quorum.
(Ver 3.3 (11022016))