Cloud Reliability—Call for Papers


Submission Deadline: 3 September 2017
Publication Date: May/June 2018

Cloud computing allows us to secure and release vast computing resources almost anywhere in the world, near instantaneously. With API-driven infrastructure and platform services, the cloud brings fresh opportunity for building and operating reliable systems. Modern deployed systems are expected not only to be available but also rapidly updatable in place with high quality. The novel capabilities in cloud computing bring new ideas in reliability to the table along the full spectrum of system development and operations.

For systems architecture, there are a number of API-accessible building blocks, from infrastructure and software separators such as “clusters,” “zones,” and “regions” to paxos-based database and storage systems spanning many data centers.

In networking, cloud-provided software-defined-networks have given birth to new service containerization models, advances in load distribution, and innovation in fault handling. Entirely new topologies have emerged which add disposable networks to disposable virtual machines.

With the advent of infrastructure as software, power has shifted to software developers to build faster, operate better, and stay close to customers. This has given rise to “DevOps” where entire services are owned and operated by engineers who used to be confined to writing code and throwing it over the wall for others to implement and operate. Whole new tool-chains have emerged to support a world where a developer can learn something about customer needs in the morning, safely deploy code before lunch, and watch customers using it by dinner, and then react quickly to unexpected events (hopefully containing impact before customers notice).

The speed of change in the cloud and its vast array of distributed services create new challenges in testing and monitoring. Chaos Monkeys periodically disrupt production infrastructure under the premise that the only way to assure fault tolerance is to continually survive faults. Testing performance is both easier and harder, easier because the resources to simulate load can be spun up in seconds and harder because workloads are harder to predict in a world which is becoming machine-invoking-machine. Monitoring also enjoys new opportunities in the cloud, as measurement opportunities vastly increase, but also faces new challenges as the sheer volumes of data become daunting.

Finally, AI and machine learning offer great promise in upping the availability game in the cloud. Anomaly detectors, capacity forecasting, failure prediction, and many other opportunities harness the rich data available, new engines available such as MxNet and Tensorflow, and the native processing capability of specialized processors to meet cloud-scale challenges in managing reliability.

The goal of this special issue is to seek original articles that examine the state of the art and highlight best practices aiming to achieve high availability of onlines services at internet scale. Topics of interest include but are not limited to:

  • “Shifting the curve”: enabling rapid innovation without sacrificing reliability
  • Using redundancy and graceful degradation to improve availability
  • Isolation: reducing the “blast radius” of an outage (by microservice, zone, region, and so on)
  • DevOps lifecycle: outage detection, mitigation and prevention
  • Chaos Monkey/Simian Army: running production drills to ensure service availability
  • Effective crisis response: best practices
  • Using automation to improve reliability
  • Using advanced statistical methods and machine learning models to detect and prevent service outages
  • Riding the wave of compute abstraction: ensuring high availability for container and function-as-a-service (Lambda) workloads
  • Measuring the availability of distributed services: holistic approaches and metrics

Important Dates

Paper submission due: 1 September 2017
First review notification: 16 October 2017
Revision submission: 30 November 2017
Second review notification: 28 December 2017
Acceptance notification: 15 January 2018

Special Issue Guest Editors

Yury Izrailevsky, Netflix
Charlie Bell, Amazon

Submission Information

Submissions should be 3,000 to 5,000 words long, with a maximum of 15 references, and should follow the magazine’s guidelines on style and presentation (see­review/magazines for full author guidelines). All submissions will be subject to single-blind, anonymous review in accordance with normal practice for scientific publications.

Authors should not assume that the audience will have specialized experience in a particular subfield. All accepted articles will be edited according to the IEEE Computer Society style guide (

Submit your papers through Manuscript Central at