Enable self-healing systems with runtime controls
Enable self-healing systems with runtime controls
Enable self-healing systems with runtime controls
This guide shows the LaunchDarkly patterns that let your system detect degradation and respond automatically to mitigate failures. These responses include rolling back code, swapping AI models, and rerouting traffic, all without a human in the loop.
Self-healing is when a system detects a problem and works to correct it without human intervention. By the end of this guide, you will understand a remediation pattern for problems with both traditional code releases and AI agent deployments.
To complete this guide, you must have the following:
Self-healing relies on two capabilities working together:
LaunchDarkly provides two paths depending on what you control:
Work through the path that fits your use case, or both.
This section describes how to facilitate self-healing in your app code.
A key organizing principle is to wrap any code change you want runtime control over in a feature flag. This is a requirement for automated remediation because you can only roll back changes you control with a flag.
Here is an example of guarding a payment-processor change wrapped in a feature flag:
Every behavior you want to auto-remediate must be behind a flag. Part of creating a flag involves defining what you want to happen if you have to revert a change or if LaunchDarkly is unavailable. This is called your flag’s fallback value. If your code runs without these conditions defined, you have nothing to roll back to if something goes wrong.
To learn more, read Fallback value.
In LaunchDarkly, connect your flag to the metrics that matter for this change. Define the thresholds that indicate a problem, such as error rate, latency p99, conversion drop, or a custom business metric.
Configure this in the LaunchDarkly, on the page for your flag’s Guarded releases. Set the following:
Guarded releases is available to customers on Business and Enterprise plans. To learn more, read Guarded releases.
Release the change gradually with percentage rollouts, rather than a full release to 100% of your user base. Start at a small percentage of traffic, such as 5% to 10%, and let LaunchDarkly observe the metrics you connected in Step 2 before you expand the release to a larger audience.
If the thresholds you defined are crossed at any rollout stage, LaunchDarkly will halt the rollout and take the remediation action you configured in Step 2.
Rolling out a change incrementally with a feature flag, and ensuring the change either succeeds or is remediated, is how to use LaunchDarkly to ensure a safe code release every time.
This section describes how to facilitate self-healing when using AI agents.
Move your agent’s prompts, model selection, and routing configuration into LaunchDarkly AgentControl configs instead of hardcoding them. This gives you runtime control over what the agent does without requiring a redeploy.
Here is an example of retrieving the active AgentControl config in Python:
If you want to simulate the outcomes of different inputs on different models, create a playground. You can upload a dataset or adjust different thresholds and prompt options to configure results.
To learn more, read Playgrounds.
Judges are automated evaluators that score your agent’s behavior on the dimensions that matter. You can use pre-defined judges, or bring your own custom judges to gauge dimensions like quality, cost, latency, correctness, or safety. Connect these evaluations to your AgentControl config as metrics.
To learn more, read Judges.
AgentControl uses these evaluation scores to trigger automated remediation, just like CodeControl uses metric thresholds.
Define what LaunchDarkly should do when evaluation scores degrade. For example, you could:
Configure these responses in your AgentControl config’s automation settings in the LaunchDarkly UI.
Before you rely on automated remediation in production, test the full loop. Here’s how:
If remediation doesn’t fire, check that your metrics source is connected and reporting, and that your flag evaluation uses the correct context.
To continue, explore the following topics: