Decision making with Bayesian statistics
Overview
This topic explains how to make decisions about which variation to choose as the winner in a LaunchDarkly experiment that uses Bayesian statistics.
In cases where you have many metrics to consider, it can be difficult to come up with a consistent decision-making strategy. To streamline decision-making, LaunchDarkly experiment results focus on the primary metric for feature change experiments, and the metric for the final step in funnel optimization experiments, when determining the winning variation.
Leading variations
As an experiment runs, LaunchDarkly will identify a “leading variation.” The leading variation is the variation with the highest probability to be best, according to the primary metric. The leading variation can be either the control, or a treatment variation. LaunchDarkly highlights the leading variation when there is not yet enough experiment data to declare a winning variation.
Winning variations
When an experiment has gathered enough data, LaunchDarkly will declare a “winning variation.” The winning variation is the leading variation with a probability to be best that exceeds the Bayesian threshold you set when you created the experiment.
If a Bayesian experiment has collected enough data to determine a winning variation, and the winning variation is not the control, then the winning variation is highlighted in green in its results chart.
Example: Search engine optimization
Consider an example where you need to choose between four search engines for your website, and you’re evaluating each one based on conversion rate. You’re measuring two types of probabilities for each search engine:
- Probability to beat control: the likelihood that the treatment search engine will achieve a higher conversion rate than the control search engine
- Probability to be best: the likelihood that the search engine will achieve a higher conversion rate than all other search engines
This table displays each search engine’s probability to beat control, probability to be best, and expected loss:
Switching between variations in a feature flag is a small configuration change, but there might be other costs associated with switching. In practice, switching to a new variation often incurs costs in terms of time, money, or resources.
For example, implementing a new search engine might require additional hardware, software integration, or employee training. Switching to variations with only marginal or uncertain improvements can lead to unnecessary disruptions or even negative outcomes if the observed effects don’t hold up over time. Therefore, you might only consider switching to search engines 2, 3, or 4 if there’s more than a 90% likelihood that the observed improvement over the control search engine is genuine and not due to random chance.
The 90% threshold reflects how confident you’d like to be before making a decision. If you require higher confidence, such as 95% or 99%, it will typically take more time to gather enough data to reach that level of certainty. To accommodate different levels of confidence, LaunchDarkly lets you customize your success threshold when creating an experiment.
In this scenario, the most logical choice is search engine 4, as it has both a high probability of outperforming the control, estimated by the probability to beat control, and the highest chance of achieving the highest conversion rate among all four options, estimated by the probability to be best. This approach can be considered an “optimal strategy” for decision-making.
However, it’s important to consider both risk and performance. You should assess the potential downside, or expected loss, in cases where there’s a small chance that the winning variation fails to deliver a genuine improvement. In our example, this means evaluating whether it would be acceptable for the conversion rate to drop by 0.25% if the chosen search engine ultimately fails to outperform the control. To learn more, read Expected loss.
Decision making when the best option changes
You may find in some experiments that the leading variation changes from day to day. For example, on Monday variation 1 is in the lead, and on Tuesday variation 2 is in the lead. This typically happens when there is no real difference between the variations, so the results change slightly day to day depending on the end users encountering the experiment.
Additionally, variations may perform differently over time due to seasonal effects. For example, a “weekend effect” can occur when user behavior shifts significantly between weekdays and weekends. This may cause one variation to be the leader on certain days, only to be outperformed by another on different days. If you suspect a weekend effect or other seasonal trends in your experiment and are seeking a holistic view, make sure the experiment runs long enough to capture several complete weekly or seasonal cycles. This will help smooth out time-based fluctuations and provide a clearer, more accurate view of each variation’s performance.
To learn more, read Bayesian versus frequentist statistics.