Statistical methodology for Bayesian experiments

This guide includes advanced concepts

This section includes an explanation of advanced statistical concepts. We provide them for informational purposes, but you do not need to understand these concepts to use Experimentation.

Overview

This guide explains the statistical methodology LaunchDarkly uses to calculate Bayesian experiment variation means, and how these analytics formulas are useful for validating your results.

For a high-level overview of Bayesian and frequentist statistics, read Bayesian versus frequentist statistics.

Core formulas

The core formulas include the posterior mean, the data mean, and the data weight. We describe these in detail below.

Posterior mean

In the Bayesian approach, the main quantity we report is the mean of the posterior distribution calculated by updating the prior distribution with data observed in your experiment.

At a high level, the posterior means for all experiment variations and for any metric type, including conversion metrics and numeric metrics, can be represented by a convenient formula:

PosteriorMean=WeightDataMean+(1Weight)PriorMean \begin{aligned} PosteriorMean = Weight \cdot DataMean + \left(1 - Weight \right) \cdot PriorMean \end{aligned}

With the following definitions:

  • Data mean: The mean estimated from the data
  • Prior mean: The mean of the Bayesian prior distribution assumed for the experiment variation mean
  • Weight: A number between 0 and 1 which broadly reflects the amount of precision in our data mean.

Broadly, the posterior mean is a weighted sum of the mean of the prior distribution and the mean calculated from the data. As more data arrives to the experiment, the weight increases and the posterior mean is influenced relatively more by the observed data and relatively less by the prior distribution. The specific behavior of this differs slightly between the control variation and the treatment variations, but this general principle holds for both.

When you hover over the “Conversion rate” or “Posterior mean” heading in an experiment’s results table, you can view the conversion rate or posterior mean.

When you hover over an actual conversion rate or posterior mean value, you can view actual numbers in the formulas instead of descriptions.

Data mean

The formula for the data mean differs between conversion metrics and numeric metrics:

  • Conversion metrics, including custom conversion binary, custom conversion count, page viewed, and clicked or tapped metrics, use the total number of conversions divided by the total number of exposures: DataMean=SampleMean=Conversions/ExposuresDataMean = SampleMean = Conversions / Exposures
  • Numeric metrics use the total value divided by the total number of exposures: DataMean=SampleMean=TotalValue/ExposuresDataMean = SampleMean = TotalValue / Exposures

CUPED may affect the exact computation of these results. For more information, read Covariate adjustment and CUPED methodology.

Data weight

The precision weight is given by:

Weight=DataMeanPrecisionDataMeanPrecision+PriorPrecision \begin{aligned} Weight = \frac{DataMeanPrecision} {DataMeanPrecision + PriorPrecision} \end{aligned}

​ This represents the proportion of the total precision due to the data mean. However, the precision is defined differently depending on the statistical model used.

There are two statistical models for estimating the posterior mean of experiment metrics:

  • Normal-normal model: This model has a normal prior and a normal likelihood, and is used for numeric metrics.
  • Beta-binomial model: This model has a beta prior distribution and a binomial likelihood, and is used for binary metrics when CUPED is not applied.

For the normal-normal model, precision is defined as the inverse of the variance, so that the precision weight is:

Weight=1/DataMeanVariance1/DataMeanVariance+1/PriorVariance \begin{aligned} Weight = \frac{1 / DataMeanVariance} {1 / DataMeanVariance + 1 / PriorVariance} \end{aligned}

For the beta-binomial model, precision is defined as the number of units for the data sample and the number of pseudo-units for the beta prior distribution. You can consider the αprior\alpha_{prior} and βprior\beta_{prior} parameters of the beta prior distribution as, respectively, the number of converted pseudo-units and the number of non-converted pseudo-units, so that the number of pseudo-units for the prior distribution is αprior+βprior\alpha_{prior} + \beta_{prior}. If we denote by nn the number of units in the data sample, then the precision weight is given by:

Weight=nn+αprior+βprior \begin{aligned} Weight = \frac{n}{n + \alpha_{prior} + \beta_{prior}} \end{aligned}

Details of our Bayesian approach

The Bayesian approach to analysis involves two steps:

  1. Incorporating a subjective prior belief about parameters of interest, usually means, plus objective data collected during the experiment to create posterior distributions for each variation representing our current knowledge about what values those parameters are likely to take.
  2. Using that posterior distribution to compute helpful statistical measures that aid in making a decision about what action to take. For example, ship the treatment, don’t ship the treatment, and so on.

The most complicated part of the setup involves creating the posterior distribution because it involves fine parameter tuning and different treatments for different types of metrics. After we compute these distributions, we indicate them to you on the results page using these summaries:

  • Credible intervals that convey the spread of the posterior distribution, which represents the range of likely values for the true mean of the variation
  • Posterior means that convey the center of the posterior distribution, which represents our current best estimate of the true mean

After the posterior distribution is created, then it is a relatively simple procedure to compute the statistics we display on the results page to help you make a decision. To learn more about these results, read Results chart data.

Below we dive into detail on how we accomplish these two steps.

Calculating posterior distributions

At LaunchDarkly, we use different statistical models for binary data and numeric data. In both cases, we use conjugate distributions, meaning that the family of the prior distribution is the same as the family of the posterior distribution:

  • For binary metrics, we start with a Beta distribution for the prior and update that into another Beta distribution for the posterior
  • For numeric metrics, we start with a Normal distribution for the prior and update that into another Normal distribution for the posterior

We give some technical details on the exact specification of the priors below, as well as some closed-form expressions for the posterior distributions once data is incorporated.

Binary data

Binary metrics are also called “occurrence” metrics in LaunchDarkly. That is, binary metrics result in either a 0 or a 1 recorded for each context in the experiment. For more information, read Custom conversion binary metrics.

The natural approach for binary data is to use a Binomial likelihood function with a Beta prior, which results in another Beta distribution for the posterior.

Suppose that yˉv\bar{y}_v is the proportion of the NvN_v units in variation vv that are converted. Then a total of NvyˉvN_{v} \bar{y}_v units converted, and Nv(1yˉv)N_{v} (1 - \bar{y}_v) units did not convert.

Numeric data

Although numeric data can take a variety of forms and be modeled by many different kinds of probability distributions, we can use a simplified approach that leverages the central limit theorem.

Because the quantity of interest is usually some unknown population mean which is estimated by the sample mean, we can have reasonably high confidence that the normal distribution will be a good fit for the likelihood of the sample mean as we collect more and more data:

flike(yˉvμv)=Normal(μv,σ2/Nv)\begin{aligned} f_{\mathrm{like}}(\bar{y}_v | \mu_v) = \mathsf{Normal}(\mu_v, \sigma^2 / N_v) \end{aligned}

To further simplify the model, we treat the variance parameter as known and simply use the natural plug-in estimate, the sample variance computed from the data. As sample sizes increase, this plug-in estimate is guaranteed to converge to the true variance.

To complete the model, we need to specify a prior distribution for μv\mu_v. For the control variation, we use an improper non-informative prior fprior(μ0)1f_{\mathrm{prior}}(\mu_0) \propto 1. For the other variations, we use priors that shrink the results towards the control variation’s mean. We generate this prior from the empirical distribution of relative differences between variations in all experiments on our platform using metrics of the same type (numeric or conversion) and aggregation function (average or sum).

The equation for this prior is:

fprior(μv)=Normal(av,wv2),av=yˉ0,wv2=yˉ02γ^2+σ^02/N0 \begin{aligned} f_{\mathrm{prior}}(\mu_v) &= \mathsf{Normal}(a_v, w_v^2), \\ a_v &= \bar{y}_0, \\ w_v^2 &= \bar{y}_0^2 \hat{\gamma}^2 + \hat{\sigma}_0^2 / N_0 \end{aligned}

where γ^2\hat{\gamma}^2 is the variance of the distribution of observed relative differences ((yˉvyˉ0)/yˉ0(\bar{y}_v - \bar{y}_0) / \bar{y}_0) across all experiments with numeric metrics on the platform. The first term, yˉ02γ^2\bar{y}_0^2 \hat{\gamma}^2, scales the expected relative difference by the observed control mean. The second term, σ^02/N0\hat{\sigma}_0^2 / N_0, accounts for the uncertainty in our estimate of the control mean. The value of γ^2\hat{\gamma}^2 is between 0.13 and 0.19, conditional on the type of the metric.

Combining the likelihood and prior provides the posterior distribution of μv\mu_v, which represents our beliefs about μv\mu_v after observing the data from the experiment.

Given the normal likelihood and prior, the posterior distribution is also a normal distribution with the following parameters:

fpost(μv)=Normal(αv,ωv2),αv=ωv2(Nvσ^v2yˉv+1wv2av),ωv2=(1wv2+Nvσ^v2)1 \begin{aligned} f_{\mathrm{post}}(\mu_v) &= \mathsf{Normal}(\alpha_v, \omega_v^2) , \\ \alpha_v &= \omega_v^2 \left(\frac{N_v}{\hat{\sigma}_v^2} \bar{y}_v + \frac{1}{w_v^2} a_v \right) , \\ \omega_v^2 &= \left(\frac{1}{w_v^2} + \frac{N_v}{\hat{\sigma}^2_v} \right)^{-1} \end{aligned}

The experiment results page displays the posterior distributions of each variation’s mean (fpost(μv)f_{\mathrm{post}}(\mu_v)) in the probability charts.

We use the expected value of the posterior distribution as a point estimate for μv\mu_v,

μ^v=E[fpost(μv)]=αv\hat{\mu}_v = \mathbb{E}[f_{\mathrm{post}}(\mu_v)] = \alpha_v

Conversion metrics

Conversion metrics use binary data. We use a Binomial likelihood function with a Beta prior, which results in another Beta distribution for the posterior.

Suppose that yˉv\bar{y}_v is the proportion of the NvN_v units in variation vv that are converted. Then a total of NvyˉvN_{v} \bar{y}_v units converted, and Nv(1yˉv)N_{v} (1 - \bar{y}_v) units did not convert.

To model the total number of conversions (NvyˉvN_{v} \bar{y}_v), we use a binomial distribution with proportion parameter μv\mu_v and size NvN_v as the likelihood function:

flike(Nvyˉv)=Binomial(Nv,μv) \begin{aligned} f_{\mathrm{like}}(N_v \bar{y}_v) &= \mathsf{Binomial}(N_v, \mu_v) \end{aligned}

We use a Beta distribution as the prior for μv\mu_v,

fprior(μv)=Beta(av,bv)\begin{aligned} f_{\mathrm{prior}}(\mu_v) &= \mathsf{Beta}(a_v, b_v) \end{aligned}

The values of the prior hyperparameters ava_v and bvb_v differ between the control (v=0v = 0) and treatment variations (v0v \neq 0). For the control variation (v=0v = 0), we use a distribution with a0=1a_0 = 1 and b0=1b_0 = 1. For the treatment variations (v0v \neq 0), we use a prior similar to the one used for numeric metrics. The prior for treatment variations is a Beta distribution with hyperparameters ava_v, bvb_v parameters such that its expected value and variance are:

E[fprior(μv)]=yˉ0,Var(fprior(μv))=yˉ02γ^2+yˉ0(1yˉ0)N0\begin{aligned} \mathbb{E}[f_{\mathrm{prior}}(\mu_v)] &= \bar{y}_0, \\ \mathrm{Var}(f_{\mathrm{prior}}(\mu_v)) &= \bar{y}_0^2 \hat{\gamma}^2 + \frac{\bar{y}_0 (1 - \bar{y}_0)}{N_0} \end{aligned}

The value of γ2\gamma^2 is the variance of the empirical distribution of relative differences of experiments using a binary metric, and is currently set to γ20.04\gamma^2 \approx 0.04.

The posterior distribution of μv\mu_v is also a Beta distribution:

fpost(μvyˉv,Nv)=Beta(av+Nvyˉv,bv+Nv(1yˉv))\begin{aligned} f_{\mathrm{post}}(\mu_v | \bar{y}_v, N_v) &= \mathsf{Beta}(a_v + N_v \bar{y}_v, b_v + N_v (1 - \bar{y}_v)) \end{aligned}

The expected value of this distribution is our preferred point estimate of μv\mu_v:

μ^v=E[fpost(μv)]=av+Nvyˉvav+bv+Nv\hat{\mu}_v = \mathbb{E}[f_{\mathrm{post}}(\mu_v)] = \frac{a_v + N_v \bar{y}_v}{a_v + b_v + N_v}

The experiment result table displays the value of μ^v\hat{\mu}_v in the Posterior mean column.

As with numeric metrics, we use the highest density interval for the (1α)×100%(1-\alpha) \times 100 \% credible interval of fpost(μv)f_{\mathrm{post}}(\mu_v)

The experiment results table displays the (1α)×100%(1-\alpha) \times 100 \% credible interval of μ^v\hat{\mu}_v in the Conversion rate column.

To calculate the relative difference in means between each variation and the control variation (%Δv,0\%\Delta_{v,0} for all v0v \neq 0), we use the same method as for numeric metrics after transforming the posterior distributions of the means to normal distributions by matching the expected values and variances.

The experiment results table displays the (1α)×100%(1-\alpha) \times 100 \% credible interval of the relative difference in means between each variation and the control variation (%Δv,0\%\Delta_{v,0} for all v0v \neq 0) in the column Relative difference from Control.

Computing relative differences, probabilities, and expected loss

Next, we discuss how we compute relative differences, probability to be best, and expected loss.

Relative differences and their credible intervals

We estimate the relative difference in means between two variations. We define the relative difference in the means of variations vv and ww as a parameter %Δv,w=(μvμw)/μw\%\Delta_{v,w} = (\mu_v - \mu_w) / \mu_w. The relative difference in the means %Δv,w\%\Delta_{v,w} also has a posterior distribution. To derive the posterior distribution of %Δv,w\%\Delta_{v,w}, we apply the delta method to μv\mu_v and μw\mu_w,

fpost(%Δv,w)Normal(αv/α01,αv2α02(ωv2αv2+ω02α02))f_{\mathrm{post}}\left(\%\Delta_{v,w}\right) \approx \mathsf{Normal}\left(\alpha_v / \alpha_0 - 1, \frac{\alpha_v^2}{\alpha_0^2} \left( \frac{\omega^2_v}{\alpha_v^2} + \frac{\omega_0^2}{\alpha_0^2} \right) \right)

Díaz-Francés (2013) show that the approximation we use for the ratio of means holds under reasonable assumptions; you can read more at “On the existence of a normal approximation to the distribution of the ratio of two independent normal random variables”.

As with the mean of the metric for a single variation, we use the (1α)×100%(1-\alpha) \times 100 \% highest density interval for (1α)×100%(1-\alpha) \times 100 \% credible interval. The experiment results table displays the (1α)×100%(1-\alpha) \times 100 \% credible interval of the relative difference in means between each variation and the control variation (%Δv,0\%\Delta_{v,0} for all v0v \neq 0) in the column Relative difference from Control.

Probability to be best

For both numeric and conversion metrics, LaunchDarkly calculates the probability to be best for each variation.

The probability to be best is the probability that the mean value per unit of a variation is the largest of all the variations if the success direction is positive. If the success direction is negative, then the probability to be best is the probability that the mean value per unit of a variation is the smallest of all the variations. The success direction is positive when the metric’s success criterion is “Higher is better,” and negative when it is “Lower is better.” LaunchDarkly calculates the probability to be best for each variation by taking samples from the posterior distributions of the μv\mu_v‘s. The proportion of samples in which a variation is the largest, or smallest if the success direction is negative, is the probability to be best for that variation.

LaunchDarkly defines the probability to beat baseline only for treatment variations. Probability to beat baseline is the probability that the mean value per unit for a variation is larger than that of the control variation, if the success direction is positive. LaunchDarkly calculates it in a similar fashion to the probability to be best.

In the case where there are only two variations (vv and ww) and the success direction of the metric is positive, the probability to be best for variation vv is the probability that the difference in means Δv,w=μvμw\Delta_{v,w} = \mu_v - \mu_w is greater than zero.

Expected loss

Ideally, shipping a winning variation would carry no risk. In reality, the probability for a treatment variation to beat the control variation is rarely 100%. This means there’s always some chance that a “winning” variation might not be an improvement over the control variation. To manage this risk, we measure it using a quantity called “expected loss.”

Expected loss represents the average potential downside of shipping a variation, quantifying how much one could expect to lose if it underperforms relative to the control variation. LaunchDarkly calculates this by integrating probability-weighted losses across all scenarios in which a given variation performs worse than control, with loss defined as the absolute difference between them.

A lower expected loss indicates lower risk, making it an important factor in choosing which variation to launch. The treatment variation with the highest probability to be best among those with a significant probability to beat control is generally considered the winner, but evaluating its expected loss clarifies the associated risk of implementing it.

For example, if you’re measuring conversion rate and have a winning variation with a 96% probability to beat control and an expected loss of 0.5%, this means there’s a strong likelihood of 96% that the winning variation will outperform the control variation. However, the 0.5% expected loss indicates that, on average, you’d expect a small 0.5% decrease in conversion rate if the winning variation were to underperform.

Expected loss does not display for percentile metrics.

Conclusion

This guide explained the statistical methods LaunchDarkly applies to Bayesian experiments. To learn about frequentist statistical methods in LaunchDarkly, read Experimentation statistical methodology for frequentist experiments.