How to use Chaos Engineering in Microsoft Azure

By Margaret N. Yarbrough Last updated Apr 15, 2022

Complex systems require resilience, and tools like chaos engineering must be used to ensure that resilience. Learn more about Azure Chaos Studio.

Image: Jay Yuno / Getty

Cloud-native applications fit more into the client-server or three-tier category than into the old monoliths. These are now a collection of services designed to combine code and platform tools to manage and control bugs and scale around the world.

This is great for users. Users get applications that are fast, responsive, and accessible from anywhere on any device. But that makes it difficult. Developer An operations team with a complex service web that is difficult to test on a large scale. You can also design for faults and incorporate redundancy into your system, but this complicates the architecture and adds new servers and additional service instances.

look: Quick glossary: â€‹â€‹DevOps (TechRepublic Premium)

Fail and test complex systems

As the complexity increases, more tests are required. This can be a problem when testing what happens when a service fails under load. How does a transaction fail if the shopping cart backend has to switch databases during a purchase? What does the restaurant delivery tracker do when the main messaging platform goes down?

We need a test model that allows us to check the running system before element failures are triggered and system behavior is tracked. The idea is to introduce a small bug into a running system and monitor how the system responds to a number of target conditions. It’s a technology known as Chaos Engineering Unaccounted for Failure Modes Disclosure, DevOps The team wasn’t ready yet.

Intent Chaos Engineering The technique isn’t about studying how the system is failing, but it can be a useful side effect. Rather, it should show how resilient they are. Netflix always had to deliver a solid customer experience to ensure that users could watch movies and shows no matter what was going on in the background.

Unsurprisingly, these techniques are being adopted by other platforms, particularly hyperscale cloud providers such as: Microsoft Azure .. If your application is running on Azure, you need to make sure that in the case of a Microsoft server, your application is still running Fail. Microsoft’s Chaos engineering team regularly investigates the impact of errors on the platform to ensure that the services on which the application depends are properly handling the errors.

Build your own mess

Definitely readable developer content

But can you use the same technique in your own application to make sure it is as resilient as the service your code is using? There is no reason not to. Microsoft may have its own team of Site Reliability Engineers to keep Azure running, but when the code runs on a large scale it has its own knowledge of both the software and the services it uses. SRE is required.

When you work on a large scale, you need to implement some form of chaos engineering to ensure the recoverability of your application. Microsoft Provides Guidance Many of the ideas for using these techniques as part of your Azure documentation came from Netflix’s experience. Chaos says it’s a process.

Thatâ€™s not surprising. Chaos can be thought of as random, but if you are using chaos to test resilience, plan to treat it like security. Microsoft’s model is discussed from the point of view of attackers and defenders. The attacker is one side of the equation who inserts a bug into the system with the aim of destroying it. Defenders, on the other hand, assess the impact of the attack, analyze the outcome, and plan mitigation measures.

The test should be treated like a science experiment. You should start with a hypothesis like “The application will continue to work even if you lose a single backend database instance”. Next, define the injected fault. This will shut down the database of the running application. After all, you will get the results that you expect. The application continues to run. The chaos engineering platform must manage all three steps, start and stop tests, and provide a way to access test results.

look: Security chaos engineering helps to find weak points in cyber defense before an attacker does (TechRepublic)

One of the important aspects of the Chaos Test is to remember that the test has a blast radius. It should be noted that they are purposely destructive and may not work. This means that you can unplug your test at any time and return to normal operation as quickly as possible. Chaos Injection requires a rollback method. Whenever possible, you should automate the whole process with one button.

Azure DevOps third-party tools show that they are interested in using these techniques when testing their application. Proofdock-Tools It combines the chaos of chaos engineering with the latest development concepts and works with observable tools in order to provide a so-called “continuous validation” and to do everything in a familiar portal.

Introduction to Azure Chaos Studio

Microsoft is currently testing a number of Chaos engineering tools for Azure applications with select customers based on its own in-house tools. Azure CTO Mark Russinovich demonstrates at Microsoft’s Spring Virtual Ignite, a combination of the Azure test administration portal and a JSON-based test script language.

Testing in Azure Chaos Studio has two components: agents running on virtual servers or embedded in code, and direct access to Azure-specific services. These are controlled by the JSON experiment description. For example, test a failover of the Cosmos DB back end of your application by simulating a failure in one of the regions of your application. Alternatively, you can use the agent in your experiment to shut down a service host on a server running a node.js application or .NET code to test the recoverability of your own application.

The experiment consists of a series of steps, each step having an action. Microsoft has developed a domain-specific declarative language for working with the application infrastructure. It has some similarities with the resource description language Bisp. You can create experiments in Visual Studio Code, save them to Azure, and list them in the Chaos Studio portal. In the portal, choose an experiment that you want to run with other elements of the Azure developer tools, using application monitoring built into your code or Azure-specific service tools and monitoring the operation of your application. Beginning.

If you’re using another Continuous Integration / Continuous Development tool like Azure DevOps or GitHub Actions, Azure Chaos Studio provides a REST API that you can use as part of a series of integration tests when building a new version of your code. Increase. It makes sense to run Chaos Studio early in the application lifecycle because you can incorporate recovery power testing into your approval process.

As cloud-native development evolves, the way applications are built is increasingly becoming the way large cloud platforms and services code. Techniques previously only needed within Netflix and Azure are now needed by everyone, and with the advent of Chaos Studio on Azure, what used to be a custom tool is now a platform anyone can use. It helps a lot to change. Fulfill the promise of a resilient system.

How to use Chaos Engineering in Microsoft Azure

Fail and test complex systems

Build your own mess

Definitely readable developer content

Introduction to Azure Chaos Studio

Developer Essentials Newsletter

See also

Related posts: