Enable self-healing systems with runtime controls | LaunchDarkly

This guide shows the LaunchDarkly patterns that let your system detect degradation and respond automatically to mitigate failures. These responses include rolling back code, swapping AI models, and rerouting traffic, all without a human in the loop.

Self-healing is when a system detects a problem and works to correct it without human intervention. By the end of this guide, you will understand a remediation pattern for problems with both traditional code releases and AI agent deployments.

Prerequisites

To complete this guide, you must have the following:

A LaunchDarkly account.
LaunchDarkly installed and initialized in your application. To learn more, read SDK.
A basic understanding of feature flags and flag targeting in LaunchDarkly. To learn more, read Targeting.

How self-healing works in LaunchDarkly

Self-healing relies on two capabilities working together:

Runtime control: your code and agents run behind feature flags or AgentControl configs, so you can change their behavior instantly without a redeploy.
Automated remediation: metrics-based thresholds or monitoring integrations trigger flag changes automatically when degradation is detected.

LaunchDarkly provides two paths depending on what you control:

Path	What you control	Related LaunchDarkly features
CodeControl	Traditional application code, services, infrastructure behavior	Feature flags, Guarded releases, flag automation
AgentControl	AI agents, LLM prompts, model selection, routing logic	AgentControl configs, model evaluation, automated config updates

Work through the path that fits your use case, or both.

CodeControl: Self-healing for application code

This section describes how to facilitate self-healing in your app code.

Step 1: Wrap every significant change in a feature flag

A key organizing principle is to wrap any code change you want runtime control over in a feature flag. This is a requirement for automated remediation because you can only roll back changes you control with a flag.

Here is an example of guarding a payment-processor change wrapped in a feature flag:

Python

1 # Evaluate the flag for the current context before new code runs
2 if ld_client.variation("new-payment-processor", context, False):
3     result = new_payment_processor.charge(order)
4 else:
5     result = legacy_payment_processor.charge(order)

Every behavior you want to auto-remediate must be behind a flag. Part of creating a flag involves defining what you want to happen if you have to revert a change or if LaunchDarkly is unavailable. This is called your flag’s fallback value. If your code runs without these conditions defined, you have nothing to roll back to if something goes wrong.

To learn more, read Fallback value.

Step 2: Define your guardrails

In LaunchDarkly, connect your flag to the metrics that matter for this change. Define the thresholds that indicate a problem, such as error rate, latency p99, conversion drop, or a custom business metric.

Configure this in the LaunchDarkly, on the page for your flag’s Guarded releases. Set the following:

The metric to monitor, such as error rate or latency.
The threshold that triggers a remediation action.
The action to take when the threshold is crossed. For example, you might want to roll the change back to 0%, disable the flag, or notify an on-call engineering team.

Guarded releases plan availability

Guarded releases is available to customers on Business and Enterprise plans. To learn more, read Guarded releases.

Step 3: Roll out gradually and observe

Release the change gradually with percentage rollouts, rather than a full release to 100% of your user base. Start at a small percentage of traffic, such as 5% to 10%, and let LaunchDarkly observe the metrics you connected in Step 2 before you expand the release to a larger audience.

If the thresholds you defined are crossed at any rollout stage, LaunchDarkly will halt the rollout and take the remediation action you configured in Step 2.

Rolling out a change incrementally with a feature flag, and ensuring the change either succeeds or is remediated, is how to use LaunchDarkly to ensure a safe code release every time.

AgentControl: Self-healing for AI agents

This section describes how to facilitate self-healing when using AI agents.

Step 1: Define your agent’s behavior in AgentControl configs

Move your agent’s prompts, model selection, and routing configuration into LaunchDarkly AgentControl configs instead of hardcoding them. This gives you runtime control over what the agent does without requiring a redeploy.

Here is an example of retrieving the active AgentControl config in Python:

Python

1 # Fetch the active AgentControl config for this agent
2 ai_config = aiclient.agent_config("support-agent-config", context, default_config)
3 
4 response = llm_client.complete(
5     model=ai_config["model"],
6     prompt=ai_config["system_prompt"],
7     user_input=user_message
8 )

If you want to simulate the outcomes of different inputs on different models, create a playground. You can upload a dataset or adjust different thresholds and prompt options to configure results.

To learn more, read Playgrounds.

Step 2: Use judges to evaluate live performance

Judges are automated evaluators that score your agent’s behavior on the dimensions that matter. You can use pre-defined judges, or bring your own custom judges to gauge dimensions like quality, cost, latency, correctness, or safety. Connect these evaluations to your AgentControl config as metrics.

To learn more, read Judges.

AgentControl uses these evaluation scores to trigger automated remediation, just like CodeControl uses metric thresholds.

Step 3: Configure automated rerouting

Define what LaunchDarkly should do when evaluation scores degrade. For example, you could:

Swap to a fallback model.
Roll back to a different prompt configuration that you know is good.
Reroute the request to a different agent path.

Configure these responses in your AgentControl config’s automation settings in the LaunchDarkly UI.

Verify your remediation loop

Before you rely on automated remediation in production, test the full loop. Here’s how:

Trigger a threshold violation in a non-production environment by injecting errors, simulating latency, or degrading evaluation scores.
Confirm LaunchDarkly detects the threshold crossing.
Confirm the configured remediation action executes by checking that the flag rolls back, the model swaps, or traffic reroutes.
Confirm your application responds to the flag change without requiring a redeploy.

If remediation doesn’t fire, check that your metrics source is connected and reporting, and that your flag evaluation uses the correct context.

Next steps

To continue, explore the following topics:

Guarded releases to configure automated rollback thresholds.
AgentControl for full agent runtime control.
Metrics to feed signals into your remediation logic.
Targeting to control rollout scope before automating remediation.