Statistical methodology for Bayesian experiments | LaunchDarkly

This guide includes advanced concepts

This section includes an explanation of advanced statistical concepts. We provide them for informational purposes, but you do not need to understand these concepts to use Experimentation.

Overview

This guide explains the statistical methodology LaunchDarkly uses to calculate Bayesian experiment variation means, and how these analytics formulas are useful for validating your results.

For a high-level overview of Bayesian and frequentist statistics, read Bayesian versus frequentist statistics.

Core formulas

The core formulas include the posterior mean, the data mean, and the data weight. We describe these in detail below.

Posterior mean

In the Bayesian approach, the main quantity we report is the mean of the posterior distribution calculated by updating the prior distribution with data observed in your experiment.

At a high level, the posterior means for all experiment variations and for any metric type, including conversion metrics and numeric metrics, can be represented by a convenient formula:

\begin{aligned} PosteriorMean = Weight \cdot DataMean + \left(1 - Weight \right) \cdot PriorMean \end{aligned}

With the following definitions:

Data mean: The mean estimated from the data
Prior mean: The mean of the Bayesian prior distribution assumed for the experiment variation mean
Weight: A number between 0 and 1 which broadly reflects the amount of precision in our data mean.

Broadly, the posterior mean is a weighted sum of the mean of the prior distribution and the mean calculated from the data. As more data arrives to the experiment, the weight increases and the posterior mean is influenced relatively more by the observed data and relatively less by the prior distribution. The specific behavior of this differs slightly between the control variation and the treatment variations, but this general principle holds for both.

When you hover over the “Conversion rate” or “Posterior mean” heading in an experiment’s results table, you can view the conversion rate or posterior mean.

When you hover over an actual conversion rate or posterior mean value, you can view actual numbers in the formulas instead of descriptions.

Data mean

The formula for the data mean differs between conversion metrics and numeric metrics:

Conversion metrics, including custom conversion binary, custom conversion count, page viewed, and clicked or tapped metrics, use the total number of conversions divided by the total number of exposures: $DataMean = SampleMean = Conversions / Exposures$
Numeric metrics use the total value divided by the total number of exposures: $DataMean = SampleMean = TotalValue / Exposures$

CUPED may affect the exact computation of these results. To learn more, read Covariate adjustment and CUPED methodology.

Data weight

The precision weight is given by:

\begin{aligned} Weight = \frac{DataMeanPrecision} {DataMeanPrecision + PriorPrecision} \end{aligned}

This represents the proportion of the total precision due to the data mean. However, the precision is defined differently depending on the statistical model used.

There are two statistical models for estimating the posterior mean of experiment metrics:

Normal-normal model: This model has a normal prior and a normal likelihood, and is used for numeric metrics.
Beta-binomial model: This model has a beta prior distribution and a binomial likelihood, and is used for binary metrics when CUPED is not applied.

For the normal-normal model, precision is defined as the inverse of the variance, so that the precision weight is:

\begin{aligned} Weight = \frac{1 / DataMeanVariance} {1 / DataMeanVariance + 1 / PriorVariance} \end{aligned}

For the beta-binomial model, precision is defined as the number of units for the data sample and the number of pseudo-units for the beta prior distribution. You can consider the $\alpha_{prior}$ and $\beta_{prior}$ parameters of the beta prior distribution as, respectively, the number of converted pseudo-units and the number of non-converted pseudo-units, so that the number of pseudo-units for the prior distribution is $\alpha_{prior} + \beta_{prior}$ . If we denote by $n$ the number of units in the data sample, then the precision weight is given by:

\begin{aligned} Weight = \frac{n}{n + \alpha_{prior} + \beta_{prior}} \end{aligned}

Details of our Bayesian approach

The Bayesian approach to analysis involves two steps:

Incorporating a subjective prior belief about parameters of interest, usually means, plus objective data collected during the experiment to create posterior distributions for each variation representing our current knowledge about what values those parameters are likely to take.
Using that posterior distribution to compute helpful statistical measures that aid in making a decision about what action to take. For example, ship the treatment, don’t ship the treatment, and so on.

The most complicated part of the setup involves creating the posterior distribution because it involves fine parameter tuning and different treatments for different types of metrics. After we compute these distributions, we indicate them to you on the results page using these summaries:

Credible intervals that convey the spread of the posterior distribution, which represents the range of likely values for the true mean of the variation
Posterior means that convey the center of the posterior distribution, which represents our current best estimate of the true mean

After the posterior distribution is created, then it is a relatively simple procedure to compute the statistics we display on the results page to help you make a decision. To learn more about these results, read Results table data.

Below we dive into detail on how we accomplish these two steps.

Calculating posterior distributions

At LaunchDarkly, we use different statistical models for binary data and numeric data. In both cases, we use conjugate distributions, meaning that the family of the prior distribution is the same as the family of the posterior distribution:

For binary metrics, we start with a Beta distribution for the prior and update that into another Beta distribution for the posterior
For numeric metrics, we start with a Normal distribution for the prior and update that into another Normal distribution for the posterior

We give some technical details on the exact specification of the priors below, as well as some closed-form expressions for the posterior distributions once data is incorporated.

Binary data

Binary metrics are also called “occurrence” metrics in LaunchDarkly. That is, binary metrics result in either a 0 or a 1 recorded for each context in the experiment. For more information, read Custom conversion binary metrics.

The natural approach for binary data is to use a Binomial likelihood function with a Beta prior, which results in another Beta distribution for the posterior.

Suppose that $\bar{y}_v$ is the proportion of the $N_v$ units in variation $v$ that are converted. Then a total of $N_{v} \bar{y}_v$ units converted, and $N_{v} (1 - \bar{y}_v)$ units did not convert.

Numeric data

Although numeric data can take a variety of forms and be modeled by many different kinds of probability distributions, we can use a simplified approach that leverages the central limit theorem.

Because the quantity of interest is usually some unknown population mean which is estimated by the sample mean, we can have reasonably high confidence that the normal distribution will be a good fit for the likelihood of the sample mean as we collect more and more data:

\begin{aligned} f_{\mathrm{like}}(\bar{y}_v | \mu_v) = \mathsf{Normal}(\mu_v, \sigma^2 / N_v) \end{aligned}

To further simplify the model, we treat the variance parameter as known and simply use the natural plug-in estimate, the sample variance computed from the data. As sample sizes increase, this plug-in estimate is guaranteed to converge to the true variance.

To complete the model, we need to specify a prior distribution for $\mu_v$ . For the control variation, we use an improper non-informative prior $f_{\mathrm{prior}}(\mu_0) \propto 1$ . For the other variations, we use priors that shrink the results towards the control variation’s mean. We generate this prior from the empirical distribution of relative differences between variations in all experiments on our platform using metrics of the same type (numeric or conversion) and aggregation function (average or sum).

The equation for this prior is:

\begin{aligned} f_{\mathrm{prior}}(\mu_v) &= \mathsf{Normal}(a_v, w_v^2), \\ a_v &= \bar{y}_0, \\ w_v^2 &= \bar{y}_0^2 \hat{\gamma}^2 + \hat{\sigma}_0^2 / N_0 \end{aligned}

where $\hat{\gamma}^2$ is the variance of the distribution of observed relative differences ( $(\bar{y}_v - \bar{y}_0) / \bar{y}_0$ ) across all experiments with numeric metrics on the platform. The first term, $\bar{y}_0^2 \hat{\gamma}^2$ , scales the expected relative difference by the observed control mean. The second term, $\hat{\sigma}_0^2 / N_0$ , accounts for the uncertainty in our estimate of the control mean. The value of $\hat{\gamma}^2$ is between 0.13 and 0.19, conditional on the type of the metric.

Combining the likelihood and prior provides the posterior distribution of $\mu_v$ , which represents our beliefs about $\mu_v$ after observing the data from the experiment.

Given the normal likelihood and prior, the posterior distribution is also a normal distribution with the following parameters:

\begin{aligned} f_{\mathrm{post}}(\mu_v) &= \mathsf{Normal}(\alpha_v, \omega_v^2) , \\ \alpha_v &= \omega_v^2 \left(\frac{N_v}{\hat{\sigma}_v^2} \bar{y}_v + \frac{1}{w_v^2} a_v \right) , \\ \omega_v^2 &= \left(\frac{1}{w_v^2} + \frac{N_v}{\hat{\sigma}^2_v} \right)^{-1} \end{aligned}

The experiment results page displays the posterior distributions of each variation’s mean ( $f_{\mathrm{post}}(\mu_v)$ ) in the probability charts.

We use the expected value of the posterior distribution as a point estimate for $\mu_v$ ,

\hat{\mu}_v = \mathbb{E}[f_{\mathrm{post}}(\mu_v)] = \alpha_v

Conversion metrics

Conversion metrics use binary data. We use a Binomial likelihood function with a Beta prior, which results in another Beta distribution for the posterior.

To model the total number of conversions ( $N_{v} \bar{y}_v$ ), we use a binomial distribution with proportion parameter $\mu_v$ and size $N_v$ as the likelihood function:

\begin{aligned} f_{\mathrm{like}}(N_v \bar{y}_v) &= \mathsf{Binomial}(N_v, \mu_v) \end{aligned}

We use a Beta distribution as the prior for $\mu_v$ ,

\begin{aligned} f_{\mathrm{prior}}(\mu_v) &= \mathsf{Beta}(a_v, b_v) \end{aligned}

The values of the prior hyperparameters $a_v$ and $b_v$ differ between the control ( $v = 0$ ) and treatment variations ( $v \neq 0$ ). For the control variation ( $v = 0$ ), we use a distribution with $a_0 = 1$ and $b_0 = 1$ . For the treatment variations ( $v \neq 0$ ), we use a prior similar to the one used for numeric metrics. The prior for treatment variations is a Beta distribution with hyperparameters $a_v$ , $b_v$ parameters such that its expected value and variance are:

\begin{aligned} \mathbb{E}[f_{\mathrm{prior}}(\mu_v)] &= \bar{y}_0, \\ \mathrm{Var}(f_{\mathrm{prior}}(\mu_v)) &= \bar{y}_0^2 \hat{\gamma}^2 + \frac{\bar{y}_0 (1 - \bar{y}_0)}{N_0} \end{aligned}

The value of $\gamma^2$ is the variance of the empirical distribution of relative differences of experiments using a binary metric, and is currently set to $\gamma^2 \approx 0.04$ .

The posterior distribution of $\mu_v$ is also a Beta distribution:

\begin{aligned} f_{\mathrm{post}}(\mu_v | \bar{y}_v, N_v) &= \mathsf{Beta}(a_v + N_v \bar{y}_v, b_v + N_v (1 - \bar{y}_v)) \end{aligned}

The expected value of this distribution is our preferred point estimate of $\mu_v$ :

\hat{\mu}_v = \mathbb{E}[f_{\mathrm{post}}(\mu_v)] = \frac{a_v + N_v \bar{y}_v}{a_v + b_v + N_v}

The experiment result table displays the value of $\hat{\mu}_v$ in the Posterior mean column.

As with numeric metrics, we use the highest density interval for the $(1-\alpha) \times 100 \%$ credible interval of $f_{\mathrm{post}}(\mu_v)$

The experiment results table displays the $(1-\alpha) \times 100 \%$ credible interval of $\hat{\mu}_v$ in the Conversion rate column.

To calculate the relative difference in means between each variation and the control variation ( $\%\Delta_{v,0}$ for all $v \neq 0$ ), we use the same method as for numeric metrics after transforming the posterior distributions of the means to normal distributions by matching the expected values and variances.

The experiment results table displays the $(1-\alpha) \times 100 \%$ credible interval of the relative difference in means between each variation and the control variation ( $\%\Delta_{v,0}$ for all $v \neq 0$ ) in the column Relative difference from Control.

Computing relative differences, probabilities, and expected loss

Next, we discuss how we compute relative differences, probability to be best, and expected loss.

Relative differences and their credible intervals

We estimate the relative difference in means between two variations. We define the relative difference in the means of variations $v$ and $w$ as a parameter $\%\Delta_{v,w} = (\mu_v - \mu_w) / \mu_w$ . The relative difference in the means $\%\Delta_{v,w}$ also has a posterior distribution. To derive the posterior distribution of $\%\Delta_{v,w}$ , we apply the delta method to $\mu_v$ and $\mu_w$ ,

f_{\mathrm{post}}\left(\%\Delta_{v,w}\right) \approx \mathsf{Normal}\left(\alpha_v / \alpha_0 - 1, \frac{\alpha_v^2}{\alpha_0^2} \left( \frac{\omega^2_v}{\alpha_v^2} + \frac{\omega_0^2}{\alpha_0^2} \right) \right)

Díaz-Francés (2013) show that the approximation we use for the ratio of means holds under reasonable assumptions; you can read more at “On the existence of a normal approximation to the distribution of the ratio of two independent normal random variables”.

As with the mean of the metric for a single variation, we use the $(1-\alpha) \times 100 \%$ highest density interval for $(1-\alpha) \times 100 \%$ credible interval. The experiment results table displays the $(1-\alpha) \times 100 \%$ credible interval of the relative difference in means between each variation and the control variation ( $\%\Delta_{v,0}$ for all $v \neq 0$ ) in the column Relative difference from Control.

Probability to be best

For both numeric and conversion metrics, LaunchDarkly calculates the probability to be best for each variation.

The probability to be best is the probability that the mean value per unit of a variation is the largest of all the variations if the success direction is positive. If the success direction is negative, then the probability to be best is the probability that the mean value per unit of a variation is the smallest of all the variations. The success direction is positive when the metric’s success criterion is “Higher is better,” and negative when it is “Lower is better.” LaunchDarkly calculates the probability to be best for each variation by taking samples from the posterior distributions of the $\mu_v$ ‘s. The proportion of samples in which a variation is the largest, or smallest if the success direction is negative, is the probability to be best for that variation.

LaunchDarkly defines the probability to beat baseline only for treatment variations. Probability to beat baseline is the probability that the mean value per unit for a variation is larger than that of the control variation, if the success direction is positive. LaunchDarkly calculates it in a similar fashion to the probability to be best.

In the case where there are only two variations ( $v$ and $w$ ) and the success direction of the metric is positive, the probability to be best for variation $v$ is the probability that the difference in means $\Delta_{v,w} = \mu_v - \mu_w$ is greater than zero.

Expected loss

Ideally, shipping a winning variation would carry no risk. In reality, the probability for a treatment variation to beat the control variation is rarely 100%. This means there’s always some chance that a “winning” variation might not be an improvement over the control variation. To manage this risk, we measure it using a quantity called “expected loss.”

Expected loss represents the average potential downside of shipping a variation, quantifying how much one could expect to lose if it underperforms relative to the control variation. LaunchDarkly calculates this by integrating probability-weighted losses across all scenarios in which a given variation performs worse than control, with loss defined as the absolute difference between them.

A lower expected loss indicates lower risk, making it an important factor in choosing which variation to launch. The treatment variation with the highest probability to be best among those with a significant probability to beat control is generally considered the winner, but evaluating its expected loss clarifies the associated risk of implementing it.

For example, if you’re measuring conversion rate and have a winning variation with a 96% probability to beat control and an expected loss of 0.5%, this means there’s a strong likelihood of 96% that the winning variation will outperform the control variation. However, the 0.5% expected loss indicates that, on average, you’d expect a small 0.5% decrease in conversion rate if the winning variation were to underperform.

Expected loss does not display for percentile metrics.

Conclusion

This guide explained the statistical methods LaunchDarkly applies to Bayesian experiments. To learn about frequentist statistical methods in LaunchDarkly, read Statistical methodology for frequentist experiments.