resilience engineering

Resilience Engineering: Frequently Asked Questions

All you need to know about resilience engineering – The what, why, when, and how

Vector 27

Let’s imagine you are in the middle of an online transaction – booking a flight ticket or withdrawing cash at the ATM – and the network snaps in between the transaction or latency hits a threshold and you are unable to complete the transaction.

Failures are inevitable, but what differentiates a system is how quickly it recovers from such outages. The answer is simple - Resilience Engineering.

Let’s explore some of the most common queries and misconceptions around resiliency engineering.

What is Resilience Engineering?

Keeping Your Software Up and Running, No Matter What!

Resilience engineering is the discipline of building resilience in software to recover faster from unexpected conditions in a manner where the user experience remains uninterrupted while continuing to provide an acceptable service level to the business. It is an approach to:

Build resilient systems by identifying and fixing failure modes before they can cause any real damage to the system.
Develop resilient software that can handle a massive scale of online transactions reliably, delivering a consistent user experience.

To achieve the above, teams need to focus on technology areas like building well-architected frameworks with resiliency patterns, monitoring systems with observability and APM tools, testing resiliency and reliability using chaos engineering, maintaining systems through SRE, automating using DevOps, and testing performance and security.

What is Resilience Engineering? | Nagarro

Who needs resilience engineering?

Every company these days is either technology-led or builds new technology.

Hence, any service or organization, big or small, that runs mission-critical software or software that provides digital convenience requires resilience engineering to make their applications reliable and resilient. This, in turn, helps deliver improved customer experience and save brand reputation.

Why do organizations need resilience engineering?

Because with resilience engineering, you can not only prevent things from going wrong but also ensure that things go right. The goal of resilience engineering is to develop systems that can effectively respond to and recover from any unexpected disruptions.

You require resilience engineering to:

Prepare your systems towards five-nines availability.
Reduce costs and time of recovery in an unforeseen event.
Continuously monitor and observe your systems.
Improve incident management.
Deliver a consistent user experience and strengthen your brand image.

Why do organizations need resilience engineering? | Nagarro

How do resilience strategies help combat some of the system failures?

Let’s discuss some examples of system failures and resilience strategies suitable for implementation:

Failure of any hardware component like server, storage, or network

Build redundancy into the applications by deploying high-availability infrastructure across different zones/regions.

A sudden spike in incoming requests causes longer delays and transaction drops preventing the application from servicing requests.

Load balance across instances to handle spikes in usage. Monitor load balancer performance/load factor to fine-tune or add more capacity

Requests between various components fail intermittently. End user requests will fail when this isn’t handled in application design.

Retry transient failures. The client software development kit (SDK) implements automatic retries in a way that’s transparent to the caller.
Example: Network connectivity or Database connection drop / timeout

Any service on which the application is dependent is not functioning correctly

Degrade gracefully if a service fails without a failover path, providing an acceptable user experience and avoiding cascading failures.
Example: Display error message / queue the requests. Other functions of the application should continue to operate

A user mistakenly deletes critical data or data has been corrupted due to unforeseen reasons.

Back up data frequently as per business needs so it can be restored if there’s any deletion or corruption. Know how much time it would take to restore a backup.
It also helps to comply with regulatory guidelines for operational resilience.

A failure during updating the production application deployments.

Automate deployments with a rollback plan.

DC failure due to power grid outage

Build redundancy into the applications with a secondary data center.

Which framework can help implement resilience engineering?

Nagarro’s Resilience Engineering Framework (NREF) helps learn, measure, and build reliable solutions to deliver a consistent experience. This ensures massive transactions can be carried out in a day with high system availability/up-time of up to four or five 9s.

What is the ideal approach to resilience testing?

System resiliency testing helps absorb impacts and recover quickly, maintaining acceptable service for users. Teams must understand the architecture, design, and infrastructure of systems and build test strategies by:

Conducting failure mode analysis

Identify all potential failures/timeouts at every point and validate

Validating application and data resiliency

Have an SOP (Standard Operating Procedure) to validate application and data availability if the host system breaks down.

Configuring and testing health probes

Design and test health probes for load balancing and traffic management.

Conducting fault injection tests for every application

Deleting sources of data, shutting down interfacing systems, consuming system resources, deleting certificates, etc.

Validating network availability

Ensure no loss of data due to latency.

Carrying out critical tests in production

Plan critical tests and automate roll-forward/rollback for production code.

How to test the resiliency of systems across the enterprise?

An ideal approach to resilience engineering

To help businesses expedite their journey to make their applications resilient and reliable, we have built a ‘Continuous Resiliency Testing Accelerator’ in line with Nagarro’s Resilience Engineering Framework. It is compatible with all the leading chaos engineering platforms and helps automate resiliency and reliability testing directly via your CI/CD pipeline.

The CRT accelerator, based on NREF, is compatible with all the leading chaos engineering tools such as LitmusChaos, Gremlin, Harness Chaos Engineering Tool, etc., and other platforms (observability, code repository, and others)

With this accelerator:

Teams can execute resiliency testing in every stage of the development cycle.

Engineers can automate reliability testing by integrating with the CI/CD pipeline making it easy to design & and execute more complex custom tests for specific chaos testing use cases.

How to test implementations designed for resiliency and reliability?

Chaos engineering is the approach to testing a system's resilience by injecting faults. This helps in identifying any potential failures and recovering from them before they cause an outage or disruption.

How is chaos engineering different from resilience engineering?

Chaos engineering is an approach to test system resiliency. While the objective of resilience engineering is to design systems to adapt in the event of failure, chaos engineering is a way to test the resiliency of the system by injecting faults into the system proactively.

Read our blog about chaos engineering and its best practices >

How is chaos engineering different from resilience engineering?

How to choose a chaos engineering tool?

There are several chaos engineering tools available, and each tool has its own strength with respect to features available, attacks, automation support, environment support, usability, enterprise support, and so on. Based on these strengths, you can select a tool that serves your purpose of testing the in-depth resiliency at each level of the environment.

When to run chaos engineering tests?

Chaos engineering tests are run to proactively identify and address any vulnerabilities and weaknesses in the system's resilience. Chaos engineering tests can be run:

During development and testing: Early in the development lifecycle, you can start by running chaos experiments in non-production environments, such as staging or testing environments. This helps identify and address issues before they reach production.
As part of Continuous Integration/Continuous Deployment (CI/CD) pipelines: Integrate chaos engineering into your CI/CD pipelines to automatically run experiments whenever there are changes to the application code or infrastructure.

Business Continuity, Disaster Recovery, and Resilience Engineering - What’s the difference?

Business Continuity, Disaster Recovery, and Resilience Engineering are related concepts but serve distinct purposes and have different focuses within the broader context of ensuring the continuity of business operations in the face of disruptions.

Business Continuity

focuses on maintaining essential business functions and services during and after a disruption. It includes making proactive and reactive plans to help businesses avoid crises and return to 'business as usual' should any unplanned event like a security breach, natural disaster, power outage, or the like occurs.

Disaster Recovery (DR)

specifically focuses on the IT infrastructure and data recovery aspects of business operations. Some of its common practices include data backup, data replication, system failover procedures, and the testing of recovery plans.

Digital Engineering

Intelligent Enterprise

Experience and Design