
Peeyush Girdhar
Chaos engineering is a term that refers to creating chaos within a system at different levels to test the resiliency of the complete stack, thereby identifying loopholes within it. With rapid adoption of cloud providers and their attendant services, microservices, and other tools/technologies, there is an increased need to test the resiliency of those systems to ensure the application remains uninterrupted. Even in cases where there are cloud, tools, or application disruptions, the systems need to prevent a complete blackout of the application.
Ideally, testing the resiliency of the system should be part of the design itself. Once integrated with design, it should be automated within the DevOps framework to continuously check for resiliency loopholes and continuously fix them. (Refer to our blog on chaos engineering orchestration with DevOps for more details.)
Chaos engineering orchestration with DevOps
To execute the resilience strategy that we discussed above, DevOps tools/technologies can help automate the complete lifecycle of setting up the chaos components/infrastructure. This is done using Terraform and creating chaos on the fly, using Configuration Management tools like Ansible. Reporting them to a wide audience, this complete process can be integrated centrally with Jenkins to ensure smooth continuous execution with CI/CD pipelines.
So, subsequently, continuous chaos will be a part of CI/CD pipelines. This is done to execute the resilience strategy for an environment during deployment itself, thereby rectifying any environment issues.
We have multiple chaos engineering tools available. And each tool has its own strength with respect to features available, attacks, automation support, environment support, usability, enterprise support, and so on. Based on these strengths, we should select a tool which serves our purpose of testing the in-depth resiliency at each level of the environment.
Here we will be looking at underneath chaos engineering tools with their practical usability and support with respect to environments.
1. Simian Army
- Chaos Monkey
- Janitor Monkey (Now Swabbie)
- Conformity Monkey
- Security Monkey
2. Gremlin
3. Chaos Blade
4. Chaos Mesh
5. LitmusChaos
6. Custom solutions
1. Simian Army:
Simian Army is a combination of different chaos engineering tools which were designed to induce different type of attacks on environments. Initially, there were ten of the tools, which were developed by Netflix, but some of them were deprecated. So here, we will be discussing currently available Simian Army tools and their respective functionalities.
a) Chaos Monkey:
- Developed by: Netflix
- Platform(s): Spinnaker
Chaos Monkey was the first chaos engineering tool developed by Netflix when they moved from On-Prem servers to AWS cloud. They developed Chaos Monkey to prevent application disruption in case of AWS EC2 instances terminations /accidental stops. This helped them to proactively check for any failures/bottlenecks in the system, mitigating them before it impacted the production environment. It is an open-source tool with no licensing cost.
Chaos Monkey requires Spinnaker and MySQL as pre-requisites for installation and can only support deployments that are managed by Spinnaker. By default, Chaos Monkey does not come with any UI support and it relies on Spinnaker for configuring attacks and their frequencies using its UI.
As of now, it only supports attacks to terminate the instances randomly to test how resilient the system is at the infra level with support for multiple public clouds like AWS, Azure, and GCP.
When to use it:
Chaos Monkey is ideal for testing the high availability of applications at infra level. If you are unsure about how an application will behave with a limited number of instances, try using Chaos Monkey to kill them randomly and check on the application/infra-availability. Chaos Monkey works best with AWS cloud for specifically testing the availability of applications, based on the termination of EC2 instances.
Pros:
- Open-source tool with no license cost
- Easy to configure
Cons:
- Requires Spinnaker & MySQL
- Supports only one type of attack
- No reporting capabilities
b) Swabbie (Formerly Janitor Monkey):
- Developed by: Netflix
- Platform(s): AWS, GCP, and other clouds
Swabbie, or as it was then known, Janitor Monkey, was initially introduced for cleaning up unused resources from AWS, thereby saving the cost of those unused resources. Swabbie supports other cloud providers and their respective resources. Cleanup can be scheduled based on the criticality and usability of the application. With the evolution of Janitor Monkey into Swabbie, it now supports applications deployed using Spinnaker.
Swabbie is an open-source tool that allows for rules configuration for cleaning up resources, rules which can be further tuned to achieve the utmost level of cost savings.
Swabbie requires Spinnaker as a pre-requisite and works on a rule of “Mark, Notify, Delete.” So, it first checks for unused resources, notifies resource owners regarding these unused resources, and then deletes them. It also allows resource owners to define exceptions for any of the resources. In exception cases, resource owners will receive a notification a day before clean-up. This allows the source owner to then set an exception flag, which will save the resources from cleanup.
When to use it:
Swabbie is ideal for creating a custom logic or using the out-of-the-box logic to reduce the cost of the complete environment by continuously monitoring the usage of the resources, notifying resource owners to their status, and deleting them automatically. Though Swabbie works with multiple public cloud providers, it provides greater functionality when it comes to AWS cloud.
Pros:
- Open-source tool with no license cost
- Helps keep costs low
Cons:
- Requires Spinnaker
- Does not support any type of infra/application attacks
- No reporting capabilities
c) Conformity Monkey:
- Developed by: Netflix
- Platform(s): AWS, GCP, and other clouds
Conformity Monkey is an open-source tool that helps keep cloud resources compliant with a pre-defined rule set of best practices. It also extends its functionalities to other cloud providers and their respective services.
Conformity Monkey is able to check on many best practices such as:
- Instances older than a certain age threshold.
- Clustered instances that are not contained in required security groups.
- Auto-scaling groups and their associated instances that do not conform to specified rules.
Similar to Swabbie, Conformity Monkey works on the rule of “Mark and Notify.” So, for autoscaling groups, Conformity Monkey loops over the different autoscaling groups and applies conformity rules. If any instance within an autoscaling group does not conform to those rules, a notification is sent to the autoscaling group owner (defined initially for each autoscaling group). Group owners then need to rectify the issues and act accordingly or mark it as an exception.
When to use it:
If there is a client need for an environment to follow specific compliance rules, Conformity Monkey is ideal for validating a standardized environment, with that rule set and best practices, modifying the environment accordingly. Like other Simian Army tools, Conformity Monkey also provides greater functionality when using AWS cloud. It provides more limited functionalities related to other clouds (such as Azure and GCP).
Pros:
- Ensures consistency across environments and prevents security vulnerabilities
- Open-source with no license cost
- Ensures best practices are always followed during deployment
Cons:
- Does not support any type of infra/application attacks
- Limited automation functionality
- No reporting capabilities
d) Security Monkey:
- Developed by: Netflix
- Platform(s): AWS, GCP, and other clouds
Security Monkey is an open-source chaos engineering tool that monitors cloud accounts for violations, policy changes, security vulnerabilities, etc., and notifies about any such issues. It also tracks the state of the cloud resources and shows what exactly has been changed since the last check. It can search multiple cloud accounts at a time and provide results for each.
Security Monkey is extendable and customizable to custom auditors, custom watchers and can provide custom alerts. It also provides UI for users to interact with, make any configurational changes, and view security outcomes.
When to use it:
As the name suggests, Security Monkey is ideal for monitoring and notifying of any security vulnerabilities at different levels of environments. Security Monkey works best with AWS and GCP clouds (limited to some services), so it helps teams mitigate vulnerabilities within AWS and GCP cloud accounts, making them security resilient.
Pros:
- Supports most of the common cloud providers
- Open-source tool with no license cost
- Helps mitigate different levels of security issues
- Comes with UI to configure various functionalities
Cons:
- Minimum automation support
- Does not support any type of infra/application attacks
- No reporting capabilities
2. Gremlin:
- Developed by: Kolton Andrus
- Platform(s): Cloud, VMs (Virtual Machines), Bare Metal Servers, Kubernetes, container, etc.
Gremlin is a SaaS (Software-as-a-Service) based chaos engineering tool, sometimes referred to as CaaS (Chaos-as-a-Service), which provides a set of attacks related to system resources, states, and networks to test the resiliency of the systems. Gremlin provides three sets of attacks, but it is the user’s responsibility to determine which attack best suits the environment and how multiple tests can be clubbed together to test the resiliency of the complete environment. Gremlin provides some level of attacks to free tier accounts, but to access Gremlin’s full range of capabilities there is a licensing cost.
As a prerequisite, Gremlin requires installing a proprietary agent in the server/container/pods of the environment. It provides different kinds of attacks at resource, state, and network levels, which can be clubbed together for testing the specific resilience scenarios.
Gremlin provides support for servers (cloud or Bare Metal), Containers and Kubernetes clusters, and supports three means of interactions (i.e., UI, APIs, and through command lines).
When to use it:
Gremlin combines many features of Simian Army tools to create a common platform that can be used to test the resilience of the system. It provides great automation support with CLI, UI, and API, so that all attacks can be automated with CI/CD itself. Gremlin is great for SaaS implementations where teams must evaluate resiliency on a variety of parameters. It offers strong integrations with multiple public clouds, Kubernetes clusters, and offers strong automation support.
Pros:
- Comes with many built-in attacks related to network, resource, cluster, container, etc. UI provides all configuration levels
- Detailed documentation
- Automation support
Cons:
- Includes a license cost to access full functionality
- Non-customizable
- Limited reporting capabilities
3. Chaos Blade
- Developed by: Alibaba
- Platform(s): Docker, Kubernetes, Bare Metal, cloud platforms
Chaos blade is an open-source chaos engineering tool designed to improve fault tolerance of systems and ensure business continuity. It provides support for the following levels of attack:
a) Resource Level (Cloud, VM, Docker): CPU, Memory, Network, Disk, etc.
b) C++ Application Level: Specifying arbitrary methods, code injection delays, tampering with variable and return values.
c) Java Application Level: Specifying class methods to inject various complex experimental scenarios.
Chaos Blade comes up with different sets of operators to create chaos attacks on different levels of environments (e.g., cloud, containers, OS, Java, etc.). Installation of these operators are prerequisite for creating any chaos within the system.
Chaos Blade covers a vast number of attacks and platforms, but it lacks reporting and the ability to schedule those attacks. It does provide documentation, but only in standard Chinese.
When to use it:
With Chaos Blade, resilience can be tested at the infra and code levels. As a result, Chaos Blade is the ideal tool for teams looking to test resilience at the code level and those look to check code maturity, with application fault injection, and thereby check the complete system’s resiliency.
Pros:
- Support for application/code-level attacks
- Quick and easy setup
- Include a vast variety of attacks with regards to Kubernetes
- Automation support
Cons:
- No UI Support
- Not customizable
- Only provides documentation in Chinese
4. Chaos Mesh
- Developed by: PingCAP
- Platform(s): Kubernetes
Chaos Mesh is an open-source Kubernetes native chaos engineering tool designed to test resiliency with different level of attacks. It also provides a UI to perform those attacks and check on the blast radius with some of the configuration settings.
Chaos Mesh provides support for attacks such as network latency, system time manipulation, kernel panics, Disk I/O, and others.
When to use it:
Chaos Mesh is ideal for teams looking for fine-grained attacks with respect to Kubernetes components and for those looking to test resilience at each of those component levels. Chaos Mesh also allows teams to tune the blast radius based on Kubernetes selectors and labels.
Pros:
- UI with many configurations
- Support for a large number of attacks
- Automation support
Cons:
- No ability to schedule attacks
- No support for access controls within the UI
The downside of Chaos Mesh is that it provides very limited/negligible support for testing resiliency of VMs or Bare Metal servers. It also does not provide any mechanism to define the attack duration from within the UI itself.
Tailored Solutions: While most solutions offer a fixed set of attacks with a fixed set of environment support, Nagarro’s in-house chaos framework comes with a large number of attacks and the flexibility to add/modify attacks with respect to environment and applications.
5. LitmusChaos
- Developed by: MayaData
- Platform: Kubernetes
LitmusChaos is an open-source chaos engineering platform. It is a Cloud Native Computing Foundation (CNCF) hosted project like Chaos Mesh. It provides several experiments for testing containers, Pods, and nodes. It also provides a centralized public repository of experiments called ChaosHub that anyone can contribute experiments to.
When to use it:
Litmus is a wide-ranging tool that offers several useful attacks and monitoring features. However, it requires a multi-step process that includes setting permissions and annotating deployments to run an experiment. While there are workflows to support the process, especially when used through the Litmus Portal, it is complex in nature. There are also a few features that don’t appear in the documentation and are only available through the project’s GitHub repository.
Pros:
- An extensible platform that integrates with other tools for custom experiments
- Web UI has a dashboard and provides resilience scores as per successful workflows
- ChaosHub hosts several experiments
- Provides automated system health checks with Litmus Probes
Cons:
- Difficult, lengthy process to run experiments
- Permissions are assigned for each chaos experiment, making it difficult to track and manage access
6. Custom solutions
The chaos engineering tools discussed above each come with a specific set of attacks and environment support. But there may be a need to add or modify attacks, based on the application or infrastructure. There also may be a need to combine a set of attacks, to fulfill resilience testing purposes, and further drill down into the overall resilience approach.
For these situations, there should be a custom engineering solution that not only allows modifying independent attacks but also allows clubbing them together.
Nagarro’s in-house chaos solution is a framework that provides users with several levels of chaos attacks in a variety of environments. It was developed to include not only generic attacks but also to show architectural issues in a descriptive manner, thereby providing the blast radius for the environment.
- Developed by: Nagarro
- Platform(s): Any
Nagarro’s solution is a completely customizable solution, which can be customized based on infra and application setup. This framework tests resilience with the following chaos attacks/scenarios:
- Instance/VM
- Facility failure (DR)
- Application
- Cluster
- Scaling
- Security
- Decoupling
- High Availability
This solution also provides users with best-in-class reporting. This reporting allows users to get a complete insight into chaos parameters, with respect to environments.
This framework also provides users with descriptive reports, which show current environment issues with blast radius and a detailed summary of all the experiments performed. It also dynamically filters environment components and displays them on the dashboard and in the report itself.
When to use it:
This solution is ideal for all kinds of environments, as it not only provides different sets of attacks with respect to different environments but also allows users to add/modify/combine attacks to fit specific use cases and makes the system resilient at each level. Teams looking for a tool to cover vast levels of attacks with support for Kubernetes, clouds (private/public), and on-prem environments would find this the ideal solution.
Pros:
- Support for a large number of attacks
- Environment-ready attacks for most environments (public cloud, Kubernetes, Containers, On-prem VM, application, etc.)
- Preeminent reporting, providing detailed feedback and improvements
- Completely customizable
- Automation support
- Complete architectural dashboard view
Cons:
- Dependency on Ansible releases
Conclusion
Each chaos engineering tool comes with its own pros and cons, based on the environment, attacks, pricing, and so on. Before selecting a chaos engineering tool, one should be focused on what kind of resilience testing is needed within a system. The following approach can be followed to select a specific tool:
High-level comparison of chaos engineering tools based on current functionalities:
Tools |
Environment Support |
Attack Types |
Can be customized? |
Can be documented? |
UI/CLI |
Automation Support |
Enterprise Support |
Reporting |
Access Control |
Simian Army |
|||||||||
Chaos Monkey |
Cloud VMs |
Terminate Instances |
No |
Yes |
Only CLI |
Limited |
No |
No |
No |
Janitor Monkey |
Public Cloud |
Resource Cleanup |
No |
Yes |
Only CLI |
Limited |
No |
No |
No |
Conformity Monkey |
Public & Private Cloud |
Conformity Validations |
No |
Yes |
Only CLI |
Limited |
No |
No |
No |
Security Monkey |
Public & Private Cloud |
Security Validations |
No |
Yes |
Both |
Limited |
No |
No |
No |
Gremlin |
VMs, Containers, Kubernetes |
Resource, State & Network |
No |
Yes |
Both |
Yes |
Yes |
Yes (Basic) |
Yes |
Chaos Blade |
Docker, Cloud VM, Bare-metal |
Resources, Application |
No |
Yes (Only Chinese) |
Only CLI |
Limited |
No |
No |
No |
Chaos Mesh |
Kubernetes |
Resources, Network (Limited) |
No |
Yes |
Both |
Limited |
No |
No |
Yes |
LitmusChaos |
Kubernetes |
Container/ Pod attacks |
No |
Yes |
Both |
No |
No |
Yes |
Yes |
Nagarro's Chaos Framework |
Any |
Resources, Network, Facility, Scalability, Application |
Yes |
Yes |
Both |
Yes |
Yes |
Yes |
Yes |
