Companies—the ones that thrive and survive in today’s mobile and web-based ecosystems—are forever searching for better ways to offer the best user experience in their apps. Such companies may already be aware of the benefits of A/B testing or experimentation, but they may not be aware of newer concepts like chaos testing—a method that introduces controlled chaos into their systems to simulate real-world failure conditions. The companies that want to remain on the cutting edge and care about how their brand is perceived can reap dividends by investing in experimental frameworks.
The normal process of experimentation can be broken down into six steps:
Facilitating this process of experimentation is a laudable goal for companies that want to perfect user experiences. Netflix, Uber, and Grab have well-documented in-house experimentation platforms that automate many of the standard processes. Automation adds guard rails to experiments and allows for a greater degree of control if something goes wrong. It is best to make sure any solution is lightweight and facilitates the creation of experimentation, is developer-friendly, and allows even non-technical users to schedule experiments and see metrics. Educating the team about these tools and the benefits they bring is also a great way to create a culture that embraces experimentation in a company.
So, what does chaos testing have to do with experimentation? The short answer is that they share some goals and often run on the same platform used for experimentation. Chaos testing is the introduction of targeted software or system failures that mimic not just system and hardware issues but also application errors that might lead to a poor experience for end-users.
An example of this would be increasing latency to a subset of machines where a service runs, sending bad data to an application.
The primary benefit is that chaos testing allows a programmatic way of inducing these failures into an environment, allowing experimenters to test different hypotheses. For example, does an Amazon order flow become affected if a non-essential service is unavailable? Chaos testing allows experiments like this to be conducted so engineers can help develop solutions to mitigate disaster when unforeseen circumstances develop.
Chaos testing has additional benefits too. It helps teams discover unknown couplings and dependencies in the microservices architecture. Sometimes a developer will use certain libraries or code with hidden dependencies that were unrealized before testing. Code regression can also be revealed during chaos testing. Chaos testing can disclose if the retry logic or circuit breakers were properly configured, or if there are any unnecessary dependencies between services. It can answer questions such as: can some of the data be cached rather than relying on these dependencies? can it fall back to an older version of the data in case a dependency is down?
For those already on board with chaos testing, the focus should be on how to make chaos test experiments easier to do. It could also be helpful in educating others about chaos testing’s primary goal—product resiliency—which is the main difference between it and traditional experimentation.
There are some best practices to consider when considering implementing experimentation and chaos testing to a company’s processes.
Chaos testing works best within an experimentation infrastructure, so for the price of one investment, companies can reap the benefits of two tools. Experimentation and chaos testing are powerful tools for gaining insight into a company’s user base and to design user experiences that maintain interest and precipitate expected goals. These tools can help a company develop a fanbase that is devoted to its app or service. The goal is to provide experiences that both enable users to accomplish their tasks quickly as well as make sure critical functions are still available if some supporting systems have failed. This is what product resiliency is all about. A company without such tools is missing out on critical data that can help them make more reliable decisions.
Rohan Tiwari is a software engineer manager at a top tech company. He has been a computer technology specialist for 10 years, heading up engineering departments in the United States, India, and Singapore. For more information, please send emails to rohan_tiwari1@hotmail.com.