For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Sign inTry it free
DocsGuidesSDKsIntegrationsAPI docsTutorialsFlagship blog
DocsGuidesSDKsIntegrationsAPI docsTutorialsFlagship blog
  • Guides
    • Cheatsheets
      • Control and govern AI agents in production
      • Enable self-healing systems with runtime controls
      • Optimize AI performance and cost with AgentControl configs
      • Run continuous experiments in production
      • Ship AI-built code with AgentControl or CodeControl
    • Feature flags
    • AgentControl
    • Experimentation
    • Statistical methodology
    • Metrics
    • Infrastructure
    • Account management
    • Teams and custom roles
    • SDKs
    • Integrations
    • REST API
    • Additional resources
Sign inTry it free
LogoLogo
On this page
  • Prerequisites
  • How continuous experimentation works
  • Step 1: Create variations and define success metrics
  • Step 2: Expose variations to production traffic
  • Step 3: Measure and promote the winner
  • Step 4: Keep the loop running
  • Next steps
GuidesCheatsheets

Run continuous experiments in production

Was this page helpful?
Previous

Ship AI-built code with AgentControl or CodeControl

Next
Built with

This guide shows how to use LaunchDarkly to experiment at the scale of AI, which can generate more variations than you can manually validate. With LaunchDarkly, you define variations, measure real-world impact against production traffic, and promote the winning variations to give them larger reach. This loop runs continuously, without redeploying.

Prerequisites

To complete this guide, you need the following:

  • A LaunchDarkly account.
  • LaunchDarkly installed and initialized in your application. To learn more, read SDK overview.
  • A feature or agent behavior you want to optimize.
Define success metrics before the experiment runs

Before you run an experiment, take the time to determine which metrics will indicate success. Changing metrics mid-experiment will invalidate the results.

How continuous experimentation works

The pattern is the same for code and for AgentControl configs: create variations, expose them to real users, measure what wins, promote it, and repeat.

This table shows where each product applies:

PathWhat you experiment on
CodeControlFeature behavior, UI variations, application logic
AgentControlPrompts, model selection, agent parameters

The loop is the same in both cases. The tooling differs slightly.

Step 1: Create variations and define success metrics

For code with CodeControl, create a multivariate feature flag with a variation for each option you want to test. Define the metrics that constitute a win, such as conversion rate, error rate, engagement, latency, or any business metric you can measure.

To learn more, read Creating new flags

Here is an example of evaluating a multivariate flag:

Python
1# Multivariate flag returns the active variation for this context
2checkout_variant = ld_client.variation("checkout-flow-experiment", context, "control")
3
4if checkout_variant == "one-page":
5 render_one_page_checkout()
6elif checkout_variant == "stepped":
7 render_stepped_checkout()
8else:
9 render_current_checkout() # control

To experiment with AI features with AgentControl, create an AgentControl config with a variation for each prompt, model, or parameter combination you want to test. Set success metrics that reflect real agent performance, such as task completion rate, output quality scores, latency, or cost per call.

To learn more, read AgentControl

Here is an example of retrieving an AgentControl config:

Python
1# AgentControl config returns the active variation for this context
2agent_config = aiclient.agent_config("response-quality-experiment", context, default_config)
3
4response = agent.run(
5 model=agent_config["model"],
6 system_prompt=agent_config["system_prompt"],
7 user_input=user_input
8)

Step 2: Expose variations to production traffic

Use LaunchDarkly’s Experimentation feature to split traffic across your variations. LaunchDarkly handles assigning traffic to variations, so each user or context consistently receives the same variation. The experiment tracks results for each variation against your defined metrics.

Start with enough traffic to reach statistical significance in a reasonable timeframe. If you want to run many experiments at the same time, prioritize the ones tied to your highest-impact metrics.

Run experiments on real production traffic

Run experiments on real production traffic, not synthetic or internal traffic. Behavior in a staging environment rarely matches what real users do.

Step 3: Measure and promote the winner

As results accumulate, performance information appears for each variation. When a variation’s results reach reach statistical significance, do the following:

  1. Promote the winner by updating the flag or AgentControl config to serve the winning variation to 100% of traffic.
  2. Confirm the change takes effect immediately. No redeploy is required.
  3. Archive the losing variations to keep your flag and config inventory clean.

For AgentControl configs, promotion updates the active model or prompt configuration globally without a code change.

Step 4: Keep the loop running

Promotion isn’t the end of the experiment. It’s the start of the next one. After you promote a winner:

  • Generate new variations against the new baseline established by your previous results.
  • Run the next experiment on the highest-impact question, which may have changed based on previous experiment results.
  • Let production data, not assumptions, drive each decision.

The goal is a system where every meaningful change starts as a variation in an experiment, generates signal, and you either promote it or discard it based on real data.

Next steps

To continue, explore the following topics:

  • Experimentation to configure traffic splits, metrics, and statistical analysis.
  • AgentControl to manage prompt and model variations for agent experiments.
  • Metrics to measure business and system impact for each variation.
  • Guarded releases to add safety guardrails to experiments in high-risk areas.