This guide shows how to use LaunchDarkly to experiment at the scale of AI, which can generate more variations than you can manually validate. With LaunchDarkly, you define variations, measure real-world impact against production traffic, and promote the winning variations to give them larger reach. This loop runs continuously, without redeploying.
To complete this guide, you need the following:
Before you run an experiment, take the time to determine which metrics will indicate success. Changing metrics mid-experiment will invalidate the results.
The pattern is the same for code and for AgentControl configs: create variations, expose them to real users, measure what wins, promote it, and repeat.
This table shows where each product applies:
The loop is the same in both cases. The tooling differs slightly.
For code with CodeControl, create a multivariate feature flag with a variation for each option you want to test. Define the metrics that constitute a win, such as conversion rate, error rate, engagement, latency, or any business metric you can measure.
To learn more, read Creating new flags
Here is an example of evaluating a multivariate flag:
To experiment with AI features with AgentControl, create an AgentControl config with a variation for each prompt, model, or parameter combination you want to test. Set success metrics that reflect real agent performance, such as task completion rate, output quality scores, latency, or cost per call.
To learn more, read AgentControl
Here is an example of retrieving an AgentControl config:
Use LaunchDarkly’s Experimentation feature to split traffic across your variations. LaunchDarkly handles assigning traffic to variations, so each user or context consistently receives the same variation. The experiment tracks results for each variation against your defined metrics.
Start with enough traffic to reach statistical significance in a reasonable timeframe. If you want to run many experiments at the same time, prioritize the ones tied to your highest-impact metrics.
Run experiments on real production traffic, not synthetic or internal traffic. Behavior in a staging environment rarely matches what real users do.
As results accumulate, performance information appears for each variation. When a variation’s results reach reach statistical significance, do the following:
For AgentControl configs, promotion updates the active model or prompt configuration globally without a code change.
Promotion isn’t the end of the experiment. It’s the start of the next one. After you promote a winner:
The goal is a system where every meaningful change starts as a variation in an experiment, generates signal, and you either promote it or discard it based on real data.
To continue, explore the following topics: