Experimentation statistical methodology for Bayesian experiments

This guide includes advanced concepts

This section includes an explanation of advanced statistical concepts. We provide them for informational purposes, but you do not need to understand these concepts to use Experimentation.

Overview

This guide explains the statistical methods LaunchDarkly applies to Bayesian experiments in its Experimentation platform.

For a high-level overview of frequentist and Bayesian statistics, read Bayesian versus frequentist statistics.

Concepts

An experiment comprises two or more variations, one or more metrics, a randomization unit, and the units assigned to those variations in the experiment. This section defines the mathematical notation which we will use in the remainder of the document.

The Experimentation-related terms and their notations for the purpose of this document include:

  • Variations: An experiment has VV variations indexed v=0,1,,V1v = 0, 1, \ldots, V-1. We will refer to the variation v=0v = 0 as the control variation.
  • Randomization units and units: The type of the experiment unit is called the randomization unit. Examples of randomization units include user, user-time, organization, and request. A unit is a specific instance of a randomization unit that you assign to a variation in the experiment. In LaunchDarkly, the randomization unit is a context kind, and a unit is a context key. At the time of an analysis, there are NvN_v units observed for variation vv, which are indexed i=1,,Nvi = 1, \ldots, N_v.
  • Metrics: An experiment can have one primary metric and several secondary metrics as described in the Metrics topic. The methods described below apply to both primary and secondary metrics. Let yv\mathbf{y}_v be a NvN_v length vector representing the metric values for variation vv, and yv,iy_{v,i} be the observed value of the metric for unit ii assigned to variation vv in the experiment.

In LaunchDarkly experiments, a metric’s unit of analysis must be the same as the unit of randomization. This means that if your experiment has “user” as the unit of randomization then any metric must also be a user-level metric. Because units in an experiment can be associated with multiple events, all events for a user are aggregated into unit-level metrics as described in the section Average and sum metrics.

Objective of an experiment

Our methods are designed around the belief that the primary objective of an experiment is to make a decision between variations. The experiment results inform that decision by providing estimates of the causal effects on the metrics of interest for each variation.

LaunchDarkly’s Experimentation platform offers Bayesian inference as an option for the reasons described in the guide Bayesian versus frequentist statistics. In Bayesian statistics, the decision process is separated into inference and decision steps. Our first step is inference, where we combine our prior beliefs with the available data to estimate the unknown parameters we will use to make our decision. We will represent our beliefs about these parameters in the form of a probability distribution, called the posterior distribution.

The second step is to make a decision. Because Bayesian estimates are probability distributions, the experimenter can interpret these estimates as probabilities and incorporate them into their decision process.

In LaunchDarkly experiments, the experimenter wants to learn the average value per unit of the metric conditional on the variation in order to make their decision. While we observe the average value in the experiment samples exposed to a variation, we do not know what the average value of that metric would be if a variation were applied to the entire target population. Let μv\mu_v refer to the unknown mean value per unit of the metric of interest for variation vv. Our statistical methods will estimate a posterior distribution for μv\mu_v for each variation vv.

We summarize the posterior distribution of μv\mu_v with the following statistics:

  • 90% credible interval is a lower and upper value that has a 90% probability of containing true value of μv\mu_v
  • Posterior mean is a point estimate of μv\mu_v

Because the primary purpose of an experiment is for you to decide which variation to launch, we estimate comparisons between variations:

  • Probability to beat control: For each treatment variation vv, the probability that μv\mu_v is larger or smaller than μ0\mu_0, depending on the metric’s success criterion.
  • Probability to be best: For each variation, the probability that μv\mu_v is larger or smaller than the μw\mu_w of all other variations, depending on the metric’s success criterion.
  • Relative difference from control: For each treatment variation vv, we calculate a point estimate and a 90% credible interval for the relative difference from control, expressed as (μvμ0)/μ0(\mu_v - \mu_0) / \mu_0.

How LaunchDarkly calculates the posterior distribution of μv\mu_v depends on whether the metric is a numeric metric or a conversion metric. We discuss the estimation procedure for each metric type separately in the following sections.

Numeric metrics

Numeric metrics have numeric values associated with their events so they can take any numeric value. Examples of numeric metrics include page load time, efficacy of various search algorithms, and number of items in a shopping cart at checkout. Numeric metrics contrast with conversion metrics which only track whether or not an event occurred. You can read more about creating these metrics in Numeric metrics.

In our statistical methods, numeric metrics are treated as unbound continuous random variables. With numeric metrics, the shape of the data generating distribution for the unit level metric values yv,ny_{v,n} is unfortunately unknown. However, because we are interested in estimating the population mean μv\mu_v, we fortunately can simplify our analysis by appealing to the Central Limit Theorem. Under some regularity conditions, as NvN_v \to \infty, the sample mean yˉv=(i=1Nvyv,i)/Nv\bar{y}_v = (\sum_{i = 1}^{N_v} y_{v,i}) / N_v is approximately normally distributed with location μv\mu_v and scale σv/Nv\sigma_v / \sqrt{N}_v.

For numeric metrics, we use the following likelihood function for the sample mean of the observed data:

flike(yˉvμv)=Normal(μv,σ2/Nv)\begin{aligned} f_{\mathrm{like}}(\bar{y}_v | \mu_v) = \mathsf{Normal}(\mu_v, \sigma^2 / N_v) \end{aligned}

For convenience and because σ\sigma is not the primary goal of our inference, we treat σ\sigma as known and equal an estimate of the standard deviation σ^\hat{\sigma} calculated from the sample. Because we use an estimated value for sigma rather than estimating it in the model, our method is an empirical Bayesian method. This is the case with most of our statistical methods, as we are willing to trade off practicality for methodological purity.

To complete the model, we need to specify a prior distribution for μv\mu_v. For the control variation, we use an improper non-informative prior fprior(μ0)1f_{\mathrm{prior}}(\mu_0) \propto 1. For the other variations, we use priors that shrink the results towards the control variation’s mean. We generate this prior from the empirical distribution of relative differences between variations in all experiments on our platform using metrics of the same type (numeric or conversion) and aggregation function (average or sum).

The equation for this prior is:

fprior(μv)=Normal(av,wv2),av=yˉ0,wv2=yˉ02γ^2+σ^02/N0, \begin{aligned} f_{\mathrm{prior}}(\mu_v) &= \mathsf{Normal}(a_v, w_v^2), \\ a_v &= \bar{y}_0, \\ w_v^2 &= \bar{y}_0^2 \hat{\gamma}^2 + \hat{\sigma}_0^2 / N_0, \end{aligned}

where γ^2\hat{\gamma}^2 is the variance of the distribution of observed relative differences ((yˉvyˉ0)/yˉ0(\bar{y}_v - \bar{y}_0) / \bar{y}_0) across all experiments with numeric metrics on the platform. The first term, yˉ02γ^2\bar{y}_0^2 \hat{\gamma}^2, scales the expected relative difference by the observed control mean. The second term, σ^02/N0\hat{\sigma}_0^2 / N_0, accounts for the uncertainty in our estimate of the control mean. The value of γ^2\hat{\gamma}^2 is between 0.13 and 0.19, conditional on the type of the metric.

Combining the likelihood and prior provides the posterior distribution of μv\mu_v, which represents our beliefs about μv\mu_v after observing the data from the experiment.

Given the normal likelihood and prior, the posterior distribution is also a normal distribution with the following parameters:

fpost(μv)=Normal(αv,ωv2),α=ω2(Nvσ^v2yˉv+1wv2av),ω2=(1wv2+Nvσ^v2)1 \begin{aligned} f_{\mathrm{post}}(\mu_v) &= \mathsf{Normal}(\alpha_v, \omega_v^2) , \\ \alpha &= \omega^2 \left(\frac{N_v}{\hat{\sigma}_v^2} \bar{y}_v + \frac{1}{w_v^2} a_v \right) , \\ \omega^2 &= \left(\frac{1}{w_v^2} + \frac{N_v}{\hat{\sigma}^2_v} \right)^{-1} \end{aligned}

The experiment results page displays the posterior distributions of each each variation’s mean (fpost(μv)f_{\mathrm{post}}(\mu_v)) in the probability charts.

We use the expected value of the posterior distribution as a point estimate for μv\mu_v,

μ^v=E[fpost(μv)]=α\hat{\mu}_v = \mathbb{E}[f_{\mathrm{post}}(\mu_v)] = \alpha

The experiment results table displays the value of μ^v\hat{\mu}_v in the Posterior mean column. We use a 90% credible interval of the posterior mean to provide a range or plausible values. Because there are multiple valid methods to calculate credible intervals, we use the highest density interval (HDI), which is the shortest interval that contains 90% of the probability mass of the posterior distribution.

We estimate the relative difference in means between two variations. We define the relative difference in the means of variations vv and ww as a parameter %Δv,w=(μvμw)/μw\%\Delta_{v,w} = (\mu_v - \mu_w) / \mu_w. The relative difference in the means %Δv,w\%\Delta_{v,w} also has a posterior distribution. To derive the posterior distribution of %Δv,w\%\Delta_{v,w}, we apply the delta method to μv\mu_v and μw\mu_w,

fpost(%Δv,w)Normal(αv/α01,αv2α02(ωv2αv2+ω02α02))f_{\mathrm{post}}\left(\%\Delta_{v,w}\right) \approx \mathsf{Normal}\left(\alpha_v / \alpha_0 - 1, \frac{\alpha_v^2}{\alpha_0^2} \left( \frac{\omega^2_v}{\alpha_v^2} + \frac{\omega_0^2}{\alpha_0^2} \right) \right)

Díaz-Francés (2013) show that the the approximation we use for the ratio of means holds under reasonable assumptions; you can read more at “On the existence of a normal approximation to the distribution of the ratio of two independent normal random variables”.

As with the mean of the metric for a single variation, we use the 90% highest density interval for 90% credible interval. The experiment results table displays the 90% credible interval of the relative difference in means between each variation and the control variation (%Δv,0\%\Delta_{v,0} for all v0v \neq 0) in the column Relative difference from Control.

Conversion metrics

Conversion metrics in LaunchDarkly indicate whether or not an event occurred. You can read more about creating conversion metrics at Create metrics.

We use different models for conversion metrics depending on whether the metric events are aggregated by unit using the average or the sum. If conversion metric events are aggregated by unit using the sum function, then the metric is interpreted as the average number of conversions per unit. We use the methods described in the previous section to estimate the mean of the metric for each variation.

If conversion metric events are aggregated by unit using the average function, the metric is interpreted as the conversion rate, meaning the proportion of users which experienced an event. Using the per-unit average of metric events ignores the number of times a unit is converted and results in a binary variable taking values of 0 or 1. Because these conversion metrics are binary, we can use a binomial distribution to model the total number of conversions, with the conversion rate inferred as the proportion parameter of the binomial distribution.

Suppose that yˉv\bar{y}_v is the proportion of the NvN_v units in variation vv that are converted. Then a total of NvyˉvN_{v} \bar{y}_v units converted, and Nv(1yˉv)N_{v} (1 - \bar{y}_v) units did not convert.

To model the total number of conversions (NvyˉvN_{v} \bar{y}_v), we use a binomial distribution with proportion parameter μv\mu_v and size NvN_v as the likelihood function:

flike(Nvyˉv)=Binomial(Nv,μv) \begin{aligned} f_{\mathrm{like}}(N_v \bar{y}_v) &= \mathsf{Binomial}(N_v, \mu_v) \end{aligned}

We denote the proportion parameter as μv\mu_v to be consistent with the notation used in the section Numeric metrics.

We use a Beta distribution as the prior for μv\mu_v,

fprior(μv)=Beta(av,bv).\begin{aligned} f_{\mathrm{prior}}(\mu_v) &= \mathsf{Beta}(a_v, b_v) . \end{aligned}

The values of the prior hyperparameters ava_v and bvb_v differ between the control (v=0v = 0) and treatment variations (v0v \neq 0). For the control variation (v=0v = 0), we use a the uniform distribution with a0=1a_0 = 1 and b0=1b_0 = 1. For the treatment variations (v0v \neq 0), we use a prior similar to the one used for numeric metrics. The prior for treatment variations is a Beta distribution with hyperparameters ava_v, bvb_v parameters such that its expected value and variance are:

E[fprior(μv)]=yˉ0,Var(fprior(μv))=yˉ02γ^2+yˉ0(1yˉ)N0.\begin{aligned} \mathbb{E}[f_{\mathrm{prior}}(\mu_v)] &= \bar{y}_0, \\ \mathrm{Var}(f_{\mathrm{prior}}(\mu_v)) &= \bar{y}_0^2 \hat{\gamma}^2 + \frac{\bar{y}_0 (1 - \bar{y})}{N_0} . \end{aligned}

The value of γ2\gamma^2 is the variance of the empirical distribution of relative differences of experiments using a binary metric, and is currently set to γ20.04\gamma^2 \approx 0.04.

The posterior distribution of μv\mu_v is also a Beta distribution:

fpost(μvyˉv,Nv)=Beta(av+Nvyˉv,bv+Nv(1yˉv))\begin{aligned} f_{\mathrm{post}}(\mu_v | \bar{y}_v, N_v) &= \mathsf{Beta}(a_v + N_v \bar{y}_v, b_v + N_v (1 - \bar{y}_v)) \end{aligned}

The expected value of this distribution is our preferred point estimate of μv\mu_v:

μ^v=E[fpost(μv)]=av+Nvyˉvav+bv+Nv\hat{\mu}_v = \mathbb{E}[f_{\mathrm{post}}(\mu_v)] = \frac{a_v + N_v \bar{y}_v}{a_v + b_v + N_v}

The experiment result table displays the value of μ^v\hat{\mu}_v in the Posterior mean column.

As with numeric metrics, we use the highest density interval for the 90% credible interval of fpost(μv)f_{\mathrm{post}}(\mu_v)

The experiment results table displays the 90% credible interval of μ^v\hat{\mu}_v in the Conversion rate column(/home/experimentation/bayesian-results#conversion-rate).

To calculate the relative difference in means between each variation and the control variation (%Δv,0\%\Delta_{v,0} for all v0v \neq 0), we use the same method as for numeric metrics after transforming the posterior distributions of the means to normal distributions by matching the expected values and variances.

The experiment results table displays the 90% credible interval of the relative difference in means between each variation and the control variation (%Δv,0\%\Delta_{v,0} for all v0v \neq 0) in the column Relative difference from Control.

Probability to be best

For both numeric and conversion metrics, we calculate the probability to be best for each variation.

The probability to be best is the probability that the mean value per unit of a variation is the largest of all the variations if the success direction is positive. If the success direction is negative, then the probability to be best is the probability that the mean value per unit of a variation is the smallest of all the variations. The success direction is positive when the metric’s success criteria is “Higher is better,” and negative when it is “Lower is better.” LaunchDarkly calculates the probability to be best for each variation by taking samples from the posterior distributions of the μv\mu_v‘s. The proportion of samples in which a variation is the largest, or smallest if the success direction is negative, is the probability to be best for that variation.

In the case where there are only two variations (vv and ww) and the success direction of the metric is positive, the probability to be best for variation vv is the probability that the difference in means Δv,w=μvμw\Delta_{v,w} = \mu_v - \mu_w is greater than zero.

Probability to beat control

The probability to beat control represents the probability that a treatment variation’s mean outperforms the control variation’s mean. When the success direction is positive, it’s the probability that the mean value per unit of a treatment variation exceeds that of the control variation. Conversely, if the success direction is negative, it reflects the probability that the mean value per unit of a treatment variation is smaller than that of the control variation.

LaunchDarkly calculates this probability for each non-control variation vv by sampling from the posterior distributions of μv\mu_v and μ0\mu_0​. The proportion of samples where a treatment variation outperforms the control, defined as the largest mean if the success direction is positive or smallest mean if the success direction is negative, determines the probability to beat control for that variation.

Sample ratio mismatch

A sample ratio mismatch (SRM) is when the observed proportions of units receiving variations differ from the proportions chosen in the experiment design. An SRM often indicates an error in the experiment implementation and that the experiment results are not valid.

To detect SRMs we use the sequential method described in these sources:

LaunchDarkly alerts you that a sample ratio mismatch has occurred when the posterior odds favoring a mismatch are greater than 99%.

For more about sample ratio mismatches in the product, read Understanding sample ratios.

Average and sum metrics

Because a unit in an experiment can have multiple metric events, but experiment metrics must have one value per unit, we aggregate all experiment metrics events associated with a unit. Suppose unit ii has EiE_i events associated with it during the experiment period, unit ii is assigned to variation vv, and yv,i,ey_{v,i,e} is the value of the eeth metric event for unit ii assigned to variation vv. LaunchDarkly calculates the metric value yv,iy_{v,i} for unit ii assigned to variation vv as follows:

  • Average: yv,i=1Eie=1Eiyv,i,ey_{v,i} = \frac{1}{E_i} \sum_{e=1}^{E_i} y_{v,i,e} if Ei1E_i \geq 1 else 0,
  • Sum: yv,i=1Eiyv,i,ey_{v,i} = \sum_{1}^{E_i} y_{v,i,e} if Ei1E_i \geq 1 else 0.

For both aggregation methods, LaunchDarkly treats units for which we do not receive metric events as having a value of zero.

For example, consider a metric named transaction_value that is defined as the value in dollars of transactions made by a user. Suppose a particular user had transaction_value events during the experiment period with values of 10, 20, and 30.

Here is how the two unit aggregation methods would work:

  • When the average aggregation method is applied, the metric value is calculated as the mean of these events, resulting in (10+20+30)/3=20.
  • When the sum aggregation method is applied, the metric value is the total of these events, resulting in 10+20+30=60.

To learn more, read Unit aggregation method.

Conclusion

This guide explained the statistical methods LaunchDarkly applies to Bayesian experiments. To learn about frequentist statistical methods in LaunchDarkly, read Experimentation statistical methodology for frequentist experiments.

Want to know more? Start a trial.
Your 14-day trial begins as soon as you sign up. Get started in minutes using the in-app Quickstart. You'll discover how easy it is to release, monitor, and optimize your software.

Want to try it out? Start a trial.
Built with