Experimentation statistical methodology for Bayesian experiments | LaunchDarkly

This guide includes advanced concepts

This section includes an explanation of advanced statistical concepts. We provide them for informational purposes, but you do not need to understand these concepts to use Experimentation.

Overview

This guide explains the statistical methods LaunchDarkly applies to Bayesian experiments in its Experimentation platform.

For a high-level overview of frequentist and Bayesian statistics, read Bayesian versus frequentist statistics.

Concepts

An experiment comprises two or more variations, one or more metrics, a randomization unit, and the units assigned to those variations in the experiment. This section defines the mathematical notation which we will use in the remainder of the document.

The Experimentation-related terms and their notations for the purpose of this document include:

Variations: An experiment has $V$ variations indexed $v = 0, 1, \ldots, V-1$ . We will refer to the variation $v = 0$ as the control variation.
Randomization units and units: The type of the experiment unit is called the randomization unit. Examples of randomization units include user, user-time, organization, and request. A unit is a specific instance of a randomization unit that you assign to a variation in the experiment. In LaunchDarkly, the randomization unit is a context kind, and a unit is a context key. At the time of an analysis, there are $N_v$ units observed for variation $v$ , which are indexed $i = 1, \ldots, N_v$ .
Metrics: An experiment can have one primary metric and several secondary metrics as described in the Metrics topic. The methods described below apply to both primary and secondary metrics. Let $\mathbf{y}_v$ be a $N_v$ length vector representing the metric values for variation $v$ , and $y_{v,i}$ be the observed value of the metric for unit $i$ assigned to variation $v$ in the experiment.
Threshold: You set the success threshold $(1-\alpha) \times 100 \%$ when creating an experiment. The threshold falls within the range of $0 \%$ to $100 \%$ . Some common choices of success threshold include $90\%$ , $95\%$ , and $99\%$ . To understand the specific step at which the threshold is set, read Creating feature change experiments or Creating funnel optimization experiments.

In LaunchDarkly experiments, a metric’s unit of analysis must be the same as the unit of randomization. This means that if your experiment has “user” as the unit of randomization then any metric must also be a user-level metric. Because units in an experiment can be associated with multiple events, all events for a user are aggregated into unit-level metrics as described in the section Average and sum metrics.

Objective of an experiment

Our methods are designed around the belief that the primary objective of an experiment is to make a decision between variations. The experiment results inform that decision by providing estimates of the causal effects on the metrics of interest for each variation.

LaunchDarkly’s Experimentation platform offers Bayesian inference as an option for the reasons described in the guide Bayesian versus frequentist statistics. In Bayesian statistics, the decision process is separated into inference and decision steps. Our first step is inference, where we combine our prior beliefs with the available data to estimate the unknown parameters we will use to make our decision. We will represent our beliefs about these parameters in the form of a probability distribution, called the posterior distribution.

The second step is to make a decision. Because Bayesian estimates are probability distributions, the experimenter can interpret these estimates as probabilities and incorporate them into their decision process.

In LaunchDarkly experiments, the experimenter wants to learn the average value per unit of the metric conditional on the variation in order to make their decision. While we observe the average value in the experiment samples exposed to a variation, we do not know what the average value of that metric would be if a variation were applied to the entire target population. Let $\mu_v$ refer to the unknown mean value per unit of the metric of interest for variation $v$ . Our statistical methods will estimate a posterior distribution for $\mu_v$ for each variation $v$ .

We summarize the posterior distribution of $\mu_v$ with the following statistics:

$\mathbf{(1-\alpha) \times 100 \%}$ credible interval is a lower and upper value that has a $(1-\alpha) \times 100 \%$ probability of containing true value of $\mu_v$
Posterior mean is a point estimate of $\mu_v$

Because the primary purpose of an experiment is for you to decide which variation to launch, we estimate comparisons between variations:

Probability to beat control: For each treatment variation $v$ , the probability that $\mu_v$ is larger or smaller than $\mu_0$ , depending on the metric’s success criterion.
Probability to be best: For each variation, the probability that $\mu_v$ is larger or smaller than the $\mu_w$ of all other variations, depending on the metric’s success criterion.
Relative difference from control: For each treatment variation $v$ , LaunchDarkly calculates a point estimate and a $(1-\alpha) \times 100 \%$ credible interval for the relative difference from control, expressed as $(\mu_v - \mu_0) / \mu_0$ .

How LaunchDarkly calculates the posterior distribution of $\mu_v$ depends on whether the metric is a numeric metric or a conversion metric. We discuss the estimation procedure for each metric type separately in the following sections.

Numeric metrics

Numeric metrics have numeric values associated with their events so they can take any numeric value. Examples of numeric metrics include page load time, efficacy of various search algorithms, and number of items in a shopping cart at checkout. Numeric metrics contrast with conversion metrics which only track whether or not an event occurred. You can read more about creating these metrics in Numeric metrics.

In our statistical methods, numeric metrics are treated as unbound continuous random variables. With numeric metrics, the shape of the data generating distribution for the unit level metric values $y_{v,n}$ is unfortunately unknown. However, because we are interested in estimating the population mean $\mu_v$ , we fortunately can simplify our analysis by appealing to the Central Limit Theorem. Under some regularity conditions, as $N_v \to \infty$ , the sample mean $\bar{y}_v = (\sum_{i = 1}^{N_v} y_{v,i}) / N_v$ is approximately normally distributed with location $\mu_v$ and scale $\sigma_v / \sqrt{N}_v$ .

For numeric metrics, we use the following likelihood function for the sample mean of the observed data:

\begin{aligned} f_{\mathrm{like}}(\bar{y}_v | \mu_v) = \mathsf{Normal}(\mu_v, \sigma^2 / N_v) \end{aligned}

For convenience and because $\sigma$ is not the primary goal of our inference, we treat $\sigma$ as known and equal an estimate of the standard deviation $\hat{\sigma}$ calculated from the sample. Because we use an estimated value for sigma rather than estimating it in the model, our method is an empirical Bayesian method. This is the case with most of our statistical methods, as we are willing to trade off practicality for methodological purity.

To complete the model, we need to specify a prior distribution for $\mu_v$ . For the control variation, we use an improper non-informative prior $f_{\mathrm{prior}}(\mu_0) \propto 1$ . For the other variations, we use priors that shrink the results towards the control variation’s mean. We generate this prior from the empirical distribution of relative differences between variations in all experiments on our platform using metrics of the same type (numeric or conversion) and aggregation function (average or sum).

The equation for this prior is:

\begin{aligned} f_{\mathrm{prior}}(\mu_v) &= \mathsf{Normal}(a_v, w_v^2), \\ a_v &= \bar{y}_0, \\ w_v^2 &= \bar{y}_0^2 \hat{\gamma}^2 + \hat{\sigma}_0^2 / N_0 \end{aligned}

where $\hat{\gamma}^2$ is the variance of the distribution of observed relative differences ( $(\bar{y}_v - \bar{y}_0) / \bar{y}_0$ ) across all experiments with numeric metrics on the platform. The first term, $\bar{y}_0^2 \hat{\gamma}^2$ , scales the expected relative difference by the observed control mean. The second term, $\hat{\sigma}_0^2 / N_0$ , accounts for the uncertainty in our estimate of the control mean. The value of $\hat{\gamma}^2$ is between 0.13 and 0.19, conditional on the type of the metric.

Combining the likelihood and prior provides the posterior distribution of $\mu_v$ , which represents our beliefs about $\mu_v$ after observing the data from the experiment.

Given the normal likelihood and prior, the posterior distribution is also a normal distribution with the following parameters:

\begin{aligned} f_{\mathrm{post}}(\mu_v) &= \mathsf{Normal}(\alpha_v, \omega_v^2) , \\ \alpha_v &= \omega_v^2 \left(\frac{N_v}{\hat{\sigma}_v^2} \bar{y}_v + \frac{1}{w_v^2} a_v \right) , \\ \omega_v^2 &= \left(\frac{1}{w_v^2} + \frac{N_v}{\hat{\sigma}^2_v} \right)^{-1} \end{aligned}

The experiment results page displays the posterior distributions of each each variation’s mean ( $f_{\mathrm{post}}(\mu_v)$ ) in the probability charts.

We use the expected value of the posterior distribution as a point estimate for $\mu_v$ ,

\hat{\mu}_v = \mathbb{E}[f_{\mathrm{post}}(\mu_v)] = \alpha_v

The experiment results table displays the value of $\hat{\mu}_v$ in the Posterior mean column. We use a $(1-\alpha) \times 100 \%$ credible interval of the posterior mean to provide a range of plausible values. Because there are multiple valid methods to calculate credible intervals, we use the highest density interval (HDI), which is the shortest interval that contains $(1-\alpha) \times 100 \%$ of the probability mass of the posterior distribution.

We estimate the relative difference in means between two variations. We define the relative difference in the means of variations $v$ and $w$ as a parameter $\%\Delta_{v,w} = (\mu_v - \mu_w) / \mu_w$ . The relative difference in the means $\%\Delta_{v,w}$ also has a posterior distribution. To derive the posterior distribution of $\%\Delta_{v,w}$ , we apply the delta method to $\mu_v$ and $\mu_w$ ,

f_{\mathrm{post}}\left(\%\Delta_{v,w}\right) \approx \mathsf{Normal}\left(\alpha_v / \alpha_0 - 1, \frac{\alpha_v^2}{\alpha_0^2} \left( \frac{\omega^2_v}{\alpha_v^2} + \frac{\omega_0^2}{\alpha_0^2} \right) \right)

Díaz-Francés (2013) show that the the approximation we use for the ratio of means holds under reasonable assumptions; you can read more at “On the existence of a normal approximation to the distribution of the ratio of two independent normal random variables”.

As with the mean of the metric for a single variation, we use the $(1-\alpha) \times 100 \%$ highest density interval for $(1-\alpha) \times 100 \%$ credible interval. The experiment results table displays the $(1-\alpha) \times 100 \%$ credible interval of the relative difference in means between each variation and the control variation ( $\%\Delta_{v,0}$ for all $v \neq 0$ ) in the column Relative difference from Control.

Conversion metrics

Conversion metrics in LaunchDarkly indicate whether or not an event occurred. You can read more about creating conversion metrics at Create metrics.

We use different models for conversion metrics depending on whether the metric events are aggregated by unit using the average or the sum. If conversion metric events are aggregated by unit using the sum function, then the metric is interpreted as the average number of conversions per unit. We use the methods described in the previous section to estimate the mean of the metric for each variation.

If conversion metric events are aggregated by unit using the average function, the metric is interpreted as the conversion rate, meaning the proportion of users which experienced an event. Using the per-unit average of metric events ignores the number of times a unit is converted and results in a binary variable taking values of 0 or 1. Because these conversion metrics are binary, we can use a binomial distribution to model the total number of conversions, with the conversion rate inferred as the proportion parameter of the binomial distribution.

Suppose that $\bar{y}_v$ is the proportion of the $N_v$ units in variation $v$ that are converted. Then a total of $N_{v} \bar{y}_v$ units converted, and $N_{v} (1 - \bar{y}_v)$ units did not convert.

To model the total number of conversions ( $N_{v} \bar{y}_v$ ), we use a binomial distribution with proportion parameter $\mu_v$ and size $N_v$ as the likelihood function:

\begin{aligned} f_{\mathrm{like}}(N_v \bar{y}_v) &= \mathsf{Binomial}(N_v, \mu_v) \end{aligned}

We denote the proportion parameter as $\mu_v$ to be consistent with the notation used in the section Numeric metrics.

We use a Beta distribution as the prior for $\mu_v$ ,

\begin{aligned} f_{\mathrm{prior}}(\mu_v) &= \mathsf{Beta}(a_v, b_v) \end{aligned}

The values of the prior hyperparameters $a_v$ and $b_v$ differ between the control ( $v = 0$ ) and treatment variations ( $v \neq 0$ ). For the control variation ( $v = 0$ ), we use a the uniform distribution with $a_0 = 1$ and $b_0 = 1$ . For the treatment variations ( $v \neq 0$ ), we use a prior similar to the one used for numeric metrics. The prior for treatment variations is a Beta distribution with hyperparameters $a_v$ , $b_v$ parameters such that its expected value and variance are:

\begin{aligned} \mathbb{E}[f_{\mathrm{prior}}(\mu_v)] &= \bar{y}_0, \\ \mathrm{Var}(f_{\mathrm{prior}}(\mu_v)) &= \bar{y}_0^2 \hat{\gamma}^2 + \frac{\bar{y}_0 (1 - \bar{y}_0)}{N_0} \end{aligned}

The value of $\gamma^2$ is the variance of the empirical distribution of relative differences of experiments using a binary metric, and is currently set to $\gamma^2 \approx 0.04$ .

The posterior distribution of $\mu_v$ is also a Beta distribution:

\begin{aligned} f_{\mathrm{post}}(\mu_v | \bar{y}_v, N_v) &= \mathsf{Beta}(a_v + N_v \bar{y}_v, b_v + N_v (1 - \bar{y}_v)) \end{aligned}

The expected value of this distribution is our preferred point estimate of $\mu_v$ :

\hat{\mu}_v = \mathbb{E}[f_{\mathrm{post}}(\mu_v)] = \frac{a_v + N_v \bar{y}_v}{a_v + b_v + N_v}

The experiment result table displays the value of $\hat{\mu}_v$ in the Posterior mean column.

As with numeric metrics, we use the highest density interval for the $(1-\alpha) \times 100 \%$ credible interval of $f_{\mathrm{post}}(\mu_v)$

The experiment results table displays the $(1-\alpha) \times 100 \%$ credible interval of $\hat{\mu}_v$ in the Conversion rate column.

To calculate the relative difference in means between each variation and the control variation ( $\%\Delta_{v,0}$ for all $v \neq 0$ ), we use the same method as for numeric metrics after transforming the posterior distributions of the means to normal distributions by matching the expected values and variances.

The experiment results table displays the $(1-\alpha) \times 100 \%$ credible interval of the relative difference in means between each variation and the control variation ( $\%\Delta_{v,0}$ for all $v \neq 0$ ) in the column Relative difference from Control.

Probability to be best

For both numeric and conversion metrics, LaunchDarkly calculates the probability to be best for each variation.

The probability to be best is the probability that the mean value per unit of a variation is the largest of all the variations if the success direction is positive. If the success direction is negative, then the probability to be best is the probability that the mean value per unit of a variation is the smallest of all the variations. The success direction is positive when the metric’s success criteria is “Higher is better,” and negative when it is “Lower is better.” LaunchDarkly calculates the probability to be best for each variation by taking samples from the posterior distributions of the $\mu_v$ ‘s. The proportion of samples in which a variation is the largest, or smallest if the success direction is negative, is the probability to be best for that variation.

In the case where there are only two variations ( $v$ and $w$ ) and the success direction of the metric is positive, the probability to be best for variation $v$ is the probability that the difference in means $\Delta_{v,w} = \mu_v - \mu_w$ is greater than zero.

Probability to beat control

The probability to beat control represents the probability that a treatment variation’s mean outperforms the control variation’s mean. When the success direction is positive, it’s the probability that the mean value per unit of a treatment variation exceeds that of the control variation. Conversely, if the success direction is negative, it reflects the probability that the mean value per unit of a treatment variation is smaller than that of the control variation.

LaunchDarkly calculates this probability for each non-control variation $v$ by sampling from the posterior distributions of $\mu_v$ and $\mu_0$ . The proportion of samples where a treatment variation outperforms the control, defined as the largest mean if the success direction is positive or smallest mean if the success direction is negative, determines the probability to beat control for that variation.

Sample ratio mismatch

A sample ratio mismatch (SRM) is when the observed proportions of units receiving variations differ from the proportions chosen in the experiment design. An SRM often indicates an error in the experiment implementation and that the experiment results are not valid.

To detect SRMs we use the sequential method described in these sources:

LaunchDarkly alerts you that a sample ratio mismatch has occurred when the posterior odds favoring a mismatch are greater than 99%.

For more about sample ratio mismatches in the product, read Understanding sample ratios.

Average and sum metrics

Because a unit in an experiment can have multiple metric events, but experiment metrics must have one value per unit, we aggregate all experiment metrics events associated with a unit. Suppose unit $i$ has $E_i$ events associated with it during the experiment period, unit $i$ is assigned to variation $v$ , and $y_{v,i,e}$ is the value of the $e$ th metric event for unit $i$ assigned to variation $v$ . LaunchDarkly calculates the metric value $y_{v,i}$ for unit $i$ assigned to variation $v$ as follows:

Average: $y_{v,i} = \frac{1}{E_i} \sum_{e=1}^{E_i} y_{v,i,e}$ if $E_i \geq 1$ else 0,
Sum: $y_{v,i} = \sum_{1}^{E_i} y_{v,i,e}$ if $E_i \geq 1$ else 0.

For both aggregation methods, LaunchDarkly treats units for which we do not receive metric events as having a value of zero.

For example, consider a metric named transaction_value that is defined as the value in dollars of transactions made by a user. Suppose a particular user had transaction_value events during the experiment period with values of 10, 20, and 30.

Here is how the two unit aggregation methods would work:

When the average aggregation method is applied, the metric value is calculated as the mean of these events, resulting in (10+20+30)/3=20.
When the sum aggregation method is applied, the metric value is the total of these events, resulting in 10+20+30=60.

To learn more, read Unit aggregation method.

Conclusion

This guide explained the statistical methods LaunchDarkly applies to Bayesian experiments. To learn about frequentist statistical methods in LaunchDarkly, read Experimentation statistical methodology for frequentist experiments.

Want to know more? Start a trial.

Your 14-day trial begins as soon as you sign up. Get started in minutes using the in-app Quickstart. You'll discover how easy it is to release, monitor, and optimize your software.

Want to try it out? Start a trial.