Shift Right with reliability testing: An approach to test confidently for a resilient system

We all aim to deliver flawless functionality, an incredible user experience, and uninterrupted services to our customers as application owners. Even though we put in 100% effort to ensure that the product’s quality is top-notch, they are bound to have some glitches. The only solution Is to perform rigorous testing to make the software more reliable.

Ever since Agile and DevOps methodologies have been introduced in software development, ‘shift left’ and ‘shift right’ have emerged as core testing concepts. While the concept of shift left has been a popular trend in continuous testing practices for a while, we are now beginning to see shift right practices as one of the rapidly emerging trends in software testing.  

Shift left vs. shift right

Shift right testing does not replace shift left, rather both complete each other.

Shift left testing executes quick, automated, repetitive tests to identify bugs and possible risks at critical phases of software development.

On the other hand, shift right methodology focuses on monitoring user behavior, usage patterns, performance metrics, and security indicators to ensure the smooth operation of the software.

The team can promptly identify any overlooked bug or optimization possibility by employing shift right testing. Guided by user feedback, the team can also address disruptions in the user experience and rectify any anomalies or performance gaps, ensuring that users do not have to go through a bad UX.

This blog will explore the what, why, and how of shift-right and chaos engineering/reliability testing..

What is shift right in software testing?

Shift right is an approach that involves conducting testing, quality assessments, and performance evaluations in a production environment under real-world conditions. This approach ensures that applications running in production can handle actual user loads while maintaining high levels of quality.

By implementing shift right practices, DevOps teams thoroughly test the applications' functionality, resilience, and reliability in a live environment. One of the objectives is to identify and resolve issues that may be challenging to anticipate in the pre-production environment.

One such approach is chaos testing that encompasses deliberately introducing controlled failures into a system to observe how it behaves under stressful or unexpected conditions. It further helps in proactively identifying system vulnerabilities, enabling teams to mitigate risks and enhance the system's reliability and resilience.

Why shift right testing is a must?

With the advent of large-scale, mission-critical applications with complex architectures and the slicing and dicing of so many features and services into manageable chunks, organizations must overcome the fear of chaos/reliability testing in production.

Here’s why shift right testing is important:

Test in the environment that really matters: Lower environments typically vary from the actual production environment in terms of configuration, infrastructure, component setup, and deployment. It may not always be feasible to create a lower environment that is an exact replica of the production. Since systems may behave differently depending on environment and traffic patterns, shift-right testing done in a controlled manner enables teams to test in the real-world production environment. This, in turn, also helps optimize the user experience by addressing any concerns that they might face otherwise.
Test against real user journeys/behaviors: Testing in lower environments is often performed along with load generation to simulate traffic, so it may not be possible to cover different user journeys and mimic user behavior. Setting up synthetic testing can help overcome this to some extent, but it is far more challenging to simulate the exact user journey or behavior.
Stay prepared for the worst-case scenario: Imagine if you want to test your application in production, you will prepare thoroughly beforehand so that the application doesn’t fail. This would lead you to push the boundaries and identify scenarios that you may not have even thought of before; after all, you cannot risk the production environment going down. This is the effect shift that the right testing has, which enables you to prepare for the worst-case scenario so that failures don’t occur in the first place.

When you perform shift right testing, it helps you in:

Testing the known knowns: These are the things you are aware of and understand.
Checking unknown knowns: These are the things you understand but are not aware of
Experimenting known unknowns: These are the things you are aware of but don’t fully understand
Discovering unknown unknowns: These are the things you are neither aware of nor fully understand.

So in the production environment, you should start with testing known knowns and not checking unknown knowns or think about experimenting known unknowns.

How to do chaos testing in Production

Chaos testing is a technique used to intentionally introduce failure scenarios in a production environment to test the resilience and robustness of a system. It is a significant practice to identify potential weaknesses and improve the overall reliability of a system.

Several tools available in the market have their own strengths with respect to features available, attacks, automation support, environment support, usability, enterprise support, etc. Based on these strengths, you can select a tool (or a combination of tools) that helps you test the in-depth resiliency at each level of the environment. You can learn more about these tools here.

As you begin with chaos testing in production, we advise you to prepare well in advance to run the tests successfully. Below is a basic checklist that will help you prepare from an operational standpoint:

Notify the relevant teams – monitoring, infrastructure, and recover/incident response team
Ensure recovery SOP is in place and is up to date.
Authorize a team member to run the test.
Validate the test in a non-production environment repeatedly.
Schedule time for the test, preferably during non-peak hours.
Conduct impact analysis and ensure the blast radius is within the acceptable risk threshold.
Operational teams are ready in case of any unintended outage.
Ensure that SLIs/SLOs are enabled and constantly monitored.
Ensure that the system is restored to its steady state.

You must also consider the following steps when doing chaos testing in production:

Plan: Before conducting chaos testing in production, start by determining the scope as well as goals. You must also define the scenarios and parameters to test the application in this stage.
Start small: Begin by testing one component or service at a time, and gradually increase the scope and complexity of the tests. This will help minimize the impact of any unexpected failures and allow you to fine-tune the testing approach before scaling up.
Monitor and measure: Use monitoring tools to track the performance and behavior of the system during testing and measure the impact of the failures on key performance indicators (KPIs) such as response time, throughput, and error rates. This will help you identify any bottlenecks or issues that may arise and determine the overall effectiveness of the testing.

Learn more about best practices to follow for chaos testing in production.

Key imperatives for running chaos experiments

We recommend the following measures to keep in mind while running chaos experiments:

Clearly define the prerequisites: Start by defining the prerequisites, such as what kind of port needs to open, whether connectivity is available, credentials, etc.
Blast radius: Start with a small blast radius.
Ensure observability: Make sure that you set observability metrics in place to measure the success and failure of the experiments.
Write the probes proficiently: A probe is a way of observing a particular set of conditions in the system that is undergoing experimentation. There are three primary use cases of the probes:
a. Start of test: While starting, we need to check whether a service’s status is healthy or not. If it is not healthy, there is no point in going and injecting a fault. If a component is already down, there's nothing more you can do to make it unserviceable.
b. End of test: End of test is also important because you want to make sure that the fault that you have injected for a certain duration and post that duration that service should be brought back online.
c. During a chaos test: This is to collect information from the system during the experiment
.
Ensure load generation: Load testing confirms that the system can handle the required number of users and still operate at a high level of performance in lower environments. This will keep a check on what amount of traffic the application can survive.
Log your applications: It becomes impossible to monitor the issues during an experiment and their impact without a logging technique. Logging allows the ‘four golden signals’ of Google to be continuously monitored:
a. Latency or time taken to service a request.
b. Errors, or rate of requests that fail explicitly.
c. Saturation, or a measure of your system on the resources that are most constrained
d. Traffic or demand being placed on the system.

Learn the step-by-step approach to perform resilience testing using chaos engineering.

Our recommendation

Chaos testing in production is a new paradigm for shift right in software testing. We recommend performing chaos tests during non-peak hours and testing one component or service at a time. You can also start with GameDays to test the resilience of systems by putting them under stress through regular simulations.

And while you are at it, make sure that you monitor continuously by keeping an eye on critical user performance metrics. This will not only help in running the tests successfully but also ensuring that user experience is not impacted.

Article 13 Jun 2023 7 min read