Statistical methodology for frequentist experiments | LaunchDarkly

This guide includes advanced concepts

This section includes an explanation of advanced statistical concepts. We provide them for informational purposes, but you do not need to understand these concepts to use Experimentation.

Overview

This guide explains the statistical methodology LaunchDarkly uses to calculate frequentist experiment variation means, and how these analytics formulas are useful for validating your results.

For a high-level overview of frequentist and Bayesian statistics, read Bayesian versus frequentist statistics.

Data mean formula

The formula for the data mean differs between conversion metrics and numeric metrics:

Conversion metrics, including custom conversion binary, custom conversion count, page viewed, and clicked or tapped metrics, use the total number of conversions divided by the total number of exposures: $DataMean = SampleMean = Conversions / Exposures$
Numeric metrics use the total value divided by the total number of exposures: $DataMean = SampleMean = TotalValue / Exposures$

When you hover on the “Conversion rate” and ‘Mean” headings in an experiment results table, the above formulas for the data mean display.

CUPED may affect the exact computation of these results. To learn more, read Covariate adjustment and CUPED methodology.

Fixed-horizon versus sequential analysis

At LaunchDarkly, we use two different modes of frequentist analysis:

Fixed-horizon analysis
Sequential analysis

Both modes produce the same summary statistics which have the same interpretation at any given point in time. However, we calculate them using different methods which result in slightly different numbers due to their intended use. These summary statistics are:

Mean (or conversion rate): the average number of conversions across all units in the metric, or the percentage of units with at least one conversion
Confidence interval: the range of values within which the true metric value is likely to fall if you were to repeat the experiment many times
Relative difference from control: how much a metric in the treatment variation differs from the control variation, expressed as a proportion of the control’s estimated value.
p-value: a measure of how likely it is that any difference observed between a treatment variation and the control variation is due to random chance, rather than an actual difference in performance between the two variations

The main difference between the version of these statistics in the fixed-horizon analysis and in the sequential analysis is:

Fixed-horizon statistics: You can only act on fixed-horizon statistics at a specific point in time calculated prior to the start of the experiment.
Sequential statistics: You can act on sequential statistics at any point during the experiment.

Fixed-horizon analysis

LaunchDarkly uses the industry- and scientific-standard z-test for fixed-horizon analyses on metrics based on means. This is because data volumes typical of online experimentation imply that sample means in nearly all cases will be approximately normally distributed.

LaunchDarkly also computes confidence intervals in the usual way based on the same normal approximation used in the z-test.

Mathematical details for the calculations involved in this test are easily found on the internet, such as on Wikipedia, so we do not delve into specifics here.

One note is that LaunchDarkly, by default, computes p-values and confidence intervals based on the relative difference between treatment and control.

The relative difference = $\frac{\bar{X}_T - \bar{X}_C}{\bar{X}_C}$ Where $\bar{X}_C$ and $\bar{X}_T$ are the sample means of the control and treatment variations, respectively.

To compute these statistical quantities, LaunchDarkly uses the standard delta method approximation to calculate the variance of the relative difference. For details of the computation, read Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas.

Sequential analysis

For sequential analysis LaunchDarkly uses the method of Howard (2021) and Waudby-Smith et. Al. (2024), sometimes referred to as generalization of always valid inference, or GAVI.

This method constructs confidence intervals that are similar in structure to the fixed-horizon ones from a z-test, but leverage modern results on concentration inequalities. This provides strong statistical guarantees regarding false positive rate and statistical power for all time, not just a specific fixed point in time.

From those confidence intervals, we can derive p-values that bring the user experience of sequential testing to parity with the ordinary fixed-horizon test.

Below, we provide the formulas for the two-sided case. The intervals for the one-sided test are similar, except with all the allowable false positive rates distributed to one side rather than both.

Confidence intervals

The general form of a symmetric two-sided confidence intervals is:

$\textrm{estimate} \pm \textrm{BF} \cdot \sigma$

Where $\sigma$ is the standard deviation of the estimate and $\textrm{BF}$ is some multiplicative bounding factor. In the usual (fixed-horizon) $t$ -test, this factor is constant with respect to time and only varies with the significance level of the test.

Rather than a fixed bounding factor like 1.96 (for a 95%, two-sided fixed-horizon confidence interval), the bounding factor varies with the sample size of the experiment. The formula for the bounding factor also depends on a tuning parameter, which we call $N^*$ .

$N^*$ chiefly affects the statistical power of the technique at a given expected sample size for the experiment. When it is roughly equal to the expected magnitude of the sample size of the experiment, then the test has the best performance. At LaunchDarkly, we set this to a reasonable default we expect will work for most use cases, although we give the user the ability to change it if they know in advance to expect significantly higher or lower traffic volume.

p-values

p-values are not a primary outcome of the GAVI technique described in the paper above. However, they can be derived from the confidence intervals since p-values and confidence intervals are dual in some sense.

We define the p-value as the smallest value of the significance level $\alpha$ for which the corresponding confidence interval $(L(\alpha), U(\alpha))$ does not include include zero.

This construction guarantees that whenever the sequential confidence interval does not include 0 (indicating significance), then the p-value will be below $\alpha$ and vice versa, so that the two measures can be used interchangeably to tell when the comparison is significant.

Conclusion

This guide explained the statistical methods LaunchDarkly applies to frequentist experiments. To learn about Bayesian statistical methods in LaunchDarkly, read Statistical methodology for Bayesian experiments.