Conquering chaos to drive digital resiliency: How to avoid the top 5 pitfalls

"The battlefield is a scene of constant chaos. The winner will be he who controls that chaos, his own and that of the enemy."

This quote by Napoleon Bonaparte, the famous French statesman, and military leader, aptly summarizes why controlling the chaos is vital to be a winner.

Managing a complex, distributed system is like a battle for a team of developers. As a developer, if you deal with an unexpected failure in your production system, it's time to identify breakdowns before they become outages. This is why you need chaos engineering.

The method to the madness: How does chaos engineering work?

The best way to begin is by creating a well-planned chaos experiment to validate the expected behavior of an application. Start by analyzing what could go wrong with the system. What will happen when a third-party service fails? Based on this, create a hypothesis and run a controlled experiment, simulating the failure to measure the impact. Broadly, these are the "dos" in chaos engineering.

Knowing what not to do is often as important as knowing what to do. So, before you run a chaos experiment, it's essential to be prepared for the challenges you might face in your journey.

The challenges

You must avoid these five major pitfalls to implement the chaos engineering framework successfully:

Choosing the wrong tool

Chaos engineering experiments must begin with understanding the system's steady state. After that, you articulate a hypothesis, followed by verifying and learning from the experiments to improve the system's resiliency. While it may seem easy to deploy a chaos engineering tool that runs automated chaos experiments and simulates real-world failures, the real question is – which tool to choose and how? Selecting the right tool is time-consuming as many tools have different sets of features and functionalities.

Picture this: You choose Chaos Monkey, one of the most frequently used tools. While it enables you to carry out random instance failures, does it cover a broad spectrum of experimentation and failure injection? After all, chaos engineering is about looking into an application's future and knowing the unpredictable failures beyond a single AWS instance.

Solution

There are several open-source and commercial chaos engineering tools. The path towards selecting the most appropriate tool (considering its features, ease of use, system platform support, and extensibility) is paved with understanding and knowing your system and its architecture well enough. A good practice would be to explore and compare a suite of tools and techniques to induce failures and build a resilient system.

For example, if your application is hosted on Amazon Web Services (AWS), tools like Gremlin and Litmus offer excellent experiments. On the other hand, if you are looking at security-specific experiments, ChaoSlingr and Infection Monkey may be your best bet.

Another important factor is that since these tools depend highly on observability and monitoring before selecting them, you must be sure about fulfilling their dependencies. Gremlin offers private network integrations that enable users to leverage Status Checks and Webhooks within the private network. This means you don't have to expose internal endpoints to the public Internet. With the Webhooks feature, teams can easily push data from Gremlin to their observability and incident management tools.

Are you planning to run a chaos engineering experiment on a Kubernetes cluster? Check out our article, and you get a step-by-step guide to walk you through the process.

Beginning with an incorrect blast radius

Blast radius refers to the extent of an artificial outage introduced into the system as part of a chaos experiment. One of the key pitfalls of chaos engineering is keeping the same blast radius for every experiment. While a small blast radius may give you a false sense of robust resilience, a large blast radius may bring down the entire system – causing more harm than good!

Solution

In a chaos experiment carried out by a SaaS customer, the team injected latency into a service call. But, this led to a disaster, as multiple components went down during the experiment, causing the entire system to shut down. All this could easily have been prevented if the team had limited the blast radius.

Minimizing the blast radius is a technique of identifying system weaknesses without breaking them accidentally. Ask yourself these questions:

What will be the impact on the front-end and back-end?
Will the customers be impacted? If yes, how?
Which functionality will be impaired?

To reduce the risk of failure, start with a small blast radius. You can then scale the attack by increasing the blast radius and magnitude frequently while applying the mitigation techniques from your learnings to make the system more resilient.

Not having a well-defined Standard Operating Procedure (SOP)

One of our enterprise customers planned a GameDay to test and prepare their system. However, the incident management systems critical to the recovery process failed due to unknown dependencies triggered by the injected fault.

In such a scenario, it would have helped if the client had a well-defined SOP with processes to recover from a failure without losing critical sprint hours.

Solution

Irrespective of how easy or complicated the chaos experiment is, it is highly recommended to have an SOP with pre-defined steps to follow if the experiment does not go as planned.

Some of the key elements of a well-rounded SOP include:

test background and purpose
roles and responsibilities of the team members
testing process
quality assurance/quality control
troubleshooting procedure

These factors reduce the risk if anything goes wrong and help teams follow a procedure to restore everything to normal.

Not recording the experiment observations

How will you put effort into fixing a problem you do not even know exists? With no visibility into your system, you can't understand how it performs, troubleshoot an incident, or make informed operational decisions.

Solution

Observability and chaos engineering go hand-in-hand. Observability helps you get a holistic view of the systems, and dashboards use graphs to report the availability and performance of systems in real-time, providing a better understanding of the impact of chaos.

So, whenever you experiment, consider these points:

start by identifying the metrics that will help you conclude your hypothesis
record measurements before testing to establish a baseline
record them during the experiment to observe the expected and unexpected changes

Inability to replicate the same experiment to validate the hypothesis

You experiment and record the results. But how can you replicate the same experiment to validate the hypothesis and show it to developers to fix the bug? Site Reliability Engineers (SREs) face these challenges if they do not understand the application's infrastructure and do not know how to bring the system back to its steady-state (which indicates your system is at its ideal operating state).

Solution

Okay, so your chaos experiment was successful. You need to create a case for engineers to log a bug in their backlog. You must re-run the same experiment and show the same results. Hence, it's highly recommended that you collect all the required data by taking screenshots from the first step to the last step of the experiment and be prepared with complete data backup to restore the system to its original state.

Conclusion

If the pandemic has taught us anything, it is to be prepared for chaos. Chaos engineering helps prepare for the worst-case scenarios and unexpected outages that our infrastructure, applications, and systems might face. Watch out for these pitfalls to implement chaos engineering experiments successfully. To know more and gain from our experience, connect with our experts!

Digital Engineering

Intelligent Enterprise

Experience and Design

Article 2 Feb 2022 6 min read

Conquering chaos to drive digital resiliency: How to avoid the top 5 pitfalls

Neharika Gianchandani

Tarun Khosla