Failing Gracefully and How Chaos Testing Boosts the Bottom Line
Share this on:
Chaos may be the kind of turmoil that most people want to avoid, but when it comes to testing large-scale systems that power computers, software, and apps of all kinds, a little prepared pandemonium on the testing block go a long way to ensure product integrity and consumer satisfaction.
To ferret out potential bugs that may have gone undetected in the initial testing of a product, software engineers must instigate trouble into a system. Why? Because nothing kills bottom-line sales and the reputation of a company faster than a nasty glitch that renders an app inoperable.
Chaos testing is a crucial layer of product development that requires time, money, and commitment by tech teams. Consider the fact that today’s large-scale consumer apps operate on millions of machines that handle many millions of user requests every day. The code that fuels the smooth operation of the micro-services that power these apps evolves with every test before consumers put it through the ultimate, practical trial of everyday use.
Chaos testing at its best is achieved when a system fails gracefully. Such an elegant flop in real-world applications means that the user would still be able to use the critical function of an application even if some of the bugs have not been identified or fixed beforehand. Chaos testing is, therefore, a critical step to ensure dependability in software and hardware applications. Specialists in this area can test whether the critical functionality of an app works even when dependencies fail and whether reliability controls like retries and circuit breakers are configured correctly.
How is chaos testing performed?
The first step in chaos testing is to define a chaos experiment. The system should be divided into two groups, one being a “control” group, the other as the “treatment” where the engineer selectively introduces chaos, or bugs, into the system.
Initially, the engineer needs a working hypothesis, which is the definition of the system’s steady state. The critical business functionality needs to be tested once chaos has been introduced.
To run chaos tests, an engineer defines a chaos experiment that would introduce controlled failures in the system within an environment of choice for a specific amount of time. The definition of the experiment should describe how much the system can deviate from its Service Level Objective (SLO) to be usable. Internally, the chaos experiment would introduce chaos or faults into the system via fault injection. The process of fault injection, which is key tooling used to perform chaos testing, starts with a specific fault that may cause further faults in the system leading to chaos.
A bottom-line truth is that if a company offers apps that are complicated and involve millions of users, hundreds of microservices, chaos testing matters. Improving resiliency in these systems is important, but the trick is to find unknown dependencies between microservices before they affect the app in the hands of the consumer.
Another complication is that hundreds of software engineering teams need to handle myriad complications and hundreds of moving pieces. Software engineers must determine what changes were made to systems and in what order to analyze results.
Inserting bugs to see how systems operate under stressful conditions is chaos testing at its core but, discovering unknown dependencies and performance bottlenecks are equally crucial. This is the point where graceful failing is apparent. Teams should consider investing in platforms and libraries that enable running chaos tests in a controlled manner. A platform that makes it easy to introduce chaos into the system, test outcomes, and monitor results is recommended by chaos practitioners. Software libraries allow tech professionals to programmatically introduce chaos, write tests and standardize the processes enough so people use them in the same or similar ways.
Specialist teams must develop testing tools that are not only easy to use but easily stoppable during the test. The ease with which a test can be stopped is necessary since nobody wants to break apart all components of a system to prove that it works. A kill switch that automatically terminates experiments is a must.
Many companies choose to use standardized chaos testing right out of the box to avoid spending the time, money, and personnel building their own. What they gain in convenience may be lost in necessary customization that suits their specific needs.
Determining whether chaos testing affects users is among the most important aspect of experiments. Developers must start with their internal environment and make sure that the code under test is separate from the one serving the live users. A good rule of thumb is to run chaos tests in live systems only after users have been divided into cohorts to mitigate risk and localize “Blast Radius” to a cohort in case something goes wrong.
In a Netflix Tech Blog, the unnamed authors asserted that their chaos testing done under a tool called Chaos Monkey proved invaluable. “Since we knew that server failures are guaranteed to happen, we wanted those failures to happened during business hours when we were on hand to fix any fallout,” the writers said. “We knew that we could rely on engineers to build resilient solutions if we have the context to expect servers to fail.”
Mistakes companies make
The biggest mistake that companies make during chaos testing is doing too much, too soon. If corners are cut in favor of speed, the time to analyze results may be compromised. Start with a known database of failure modes so that engineers know in advance what are the expected failures and expected outcomes. Introduce chaos into the system incrementally; analyze the results of how the system behaved under each chaos experiment, and then slowly introduce more and more bugs to eventually improve the system overall.
Another issue that many companies encounter is that they do not conduct chaos testing on an appropriate scale. If a live system has 100 million users, the environment to introduce chaos should not be 100 users. Similar scales that relate to the user count are required or the chaos testing will not be relevant. Decent coverage of chaos testing is required to make sure all critical business functions work properly.
Chaos testing is necessary to identify potential problems in a system before live users discover that their app is giving them a poor experience, or worse, is completely unusable.
Although time and money must be spent on the front end of development, the return on the investment through building a reputation as a reliable developer will mean that existing customers will become loyal customers and through word of mouth, the business will flourish.
About the Author:
Rohan Tiwari is a software engineering manager at a top tech company. He has been a computer technology specialist for more than 10 years, heading up engineering departments in the United States, India, and Singapore. For more information, please send an email to firstname.lastname@example.org.