Choosing a statistical methodology

Overview

This guide explains how to choose a statistical methodology for your LaunchDarkly experiments.

LaunchDarkly offers two different approaches for analyzing your experiment: Bayesian and frequentist. Within the frequentist approach, you can choose a traditional fixed-horizon analysis or a sequential analysis. The full menu of analysis types looks like this:

  • Bayesian
  • Frequentist
    • Fixed-horizon (t-test)
    • Sequential

All three analysis modes help you make an informed decision by measuring the differences between your variations and quantifying uncertainty. However, they differ in the details of how they achieve this. We discuss the pros and cons of each mode below.

How to begin choosing a methodology

If you just need a quick guide to get you started, we recommend answering the following simple questions:

Question 1: Bayesian or frequentist?

To begin, ask yourself:

  • Are you already familiar with frequentist concepts like p-values or confidence intervals, or
  • Do you care deeply about controlling your rate of false positives?

If so, then use a frequentist approach. Otherwise, a Bayesian analysis may feel more intuitive.

Question 2, if frequentist: Fixed-horizon or sequential?

If you choose a frequentist approach:

  • Are you comfortable performing a sample size calculation and/or do you have a reasonable guess as to the effect size you may encounter in this experiment?

If so, then a fixed-horizon analysis will offer the most statistical power at the cost of having to wait until you’ve reached your sample size before making a decision. Otherwise, the sequential approach is hassle-free and still relatively efficient for the vast majority of experiments.

Bayesian analysis

The differences between Bayesian and frequentist approaches are too wide-ranging to fully cover here, but in a nutshell, Bayesian analyses:

  • Incorporate subjective prior belief in your analysis, so numbers that are presented on the results page are not reflective solely of your observed data
  • Produce summaries of your A/B comparisons in terms of straightforward probabilities, for example “probability to beat baseline”

For more information on how we compute your Bayesian results, read Statistical methodology for Bayesian experiments.

When to use Bayesian

Bayesian analysis is useful when:

  • You want easy-to-communicate summaries of the relative performance of your test that are simple to understand for non-technical stakeholders, especially when sample sizes are small and the signal is likely to be relatively weak

Bayesian analyses do have a few downsides, though:

  • Bayesian analyses do not control for false positives. If you run many A/A tests, you might encounter a relatively high amount of “significant” results in the sense of having high “probability to be best” numbers.
  • The influence of a prior distribution can sometimes result in unintuitive numbers. For example, a binary metric with 0 conversions out of 100 exposures may nonetheless show a posterior mean that is strictly positive. This would be technically correct from a Bayesian perspective, but potentially difficult to explain to a non-technical audience.

In general, we recommend Bayesian as an option for users who may often run up against smaller sample sizes or who value the ability to report results in terms of simple probabilities.

Frequentist analysis

Frequentist analyses are distinguished from Bayesian analyses in many ways, but a few that you might notice immediately include the fact that frequentist analyses:

  • Use p-values. They quantify statistical significance by contrasting what is observed to what would be observed if the control and treatment were no different from each other.
  • Are more objective. Frequentist techniques do not permit the injection of subjective prior belief into the analysis.

For more information on how we compute your frequentist results, read Statistical methodology for frequentist experiments.

When to use frequentist

Frequentist analyses are advantageous when:

  • You prefer strong statistical guarantees. Frequentist tests guarantee that false positives are controlled at a specific level (the “significance level”), and this level is easily configurable by the user. Furthermore, the power of a test is well-defined in a frequentist setting, so you can get good estimates of the expected run time prior to starting an experiment.
  • You prefer objective results. Frequentist results do not use priors, so results will not be adjusted away from your observed data. For example, a binary metric with 0 conversions out of 100 exposures will always report a mean of 0.0%, not some other positive value.

Frequentist analysis does come with some downsides:

  • p-values and confidence intervals are less intuitive than their Bayesian counterparts and may be misinterpreted. A p-value is not the probability that the treatment is better than the control. This is a common beginner mistake that can easily lead to erroneous conclusions.
  • Low sample sizes are more challenging to deal with. Lack of statistical significance results in higher p-values, which require more nuance to interpret. Unlike in Bayesian analysis where a weak signal like “60% chance to beat control” is easy to understand, a weak p-value of, say, 0.30 is more difficult to interpret intuitively.

In general, we recommend frequentist techniques when you expect to have a reasonable amount of data, or you prefer to have stronger statistical guarantees at the expense of slightly more complicated interpretation.

Fixed-horizon versus sequential

LaunchDarkly offers two different types of frequentist analysis:

  • Fixed-horizon (t-test)
  • Sequential

These two types have the same results experience: the numbers produced by each method can be interpreted in the exact same way. However, they are intended to be used differently:

  • In the fixed-horizon method, you must calculate a sample size ahead of time, and only take action on your test when you reach that sample size. You may look at the results ahead of time, but you should not make decisions until your pre-determined sample size is hit.
  • In the sequential method, you can take action on your results at any time.

However, the sequential method does pay a price for the luxury of being able to check your results at any time. The results are generally more conservative and therefore could result in the test taking longer to achieve significance, all else being equal.

On the flip side, if the measured effect on your test is much larger than you anticipated, then sequential testing affords you the option to stop the test early rather than waiting all the way until you reach your pre-computed sample size in a fixed-horizon analysis. This may save you time over repeated experiments in the long run.

When to use fixed-horizon versus sequential

In short, the fixed-horizon method has higher statistical power provided that you have a good estimate of the effect size you’re after and are comfortable performing a sample size calculation. However, because this can be burdensome, we recommend:

  • Use sequential if:
    • Time to significance is not absolutely critical, or
    • You don’t have a good sense of what types of effect sizes you might encounter, or
    • You’re not comfortable or don’t want to deal with performing a sample size calculation
  • Otherwise, use fixed-horizon

For this reason, at LaunchDarkly we default the frequentist analysis to use sequential testing.

Conclusion

This guide explained how to choose a statistical methodology for your LaunchDarkly experiments, including whether to use fixed-horizon or sequential testing for your frequentist analysis. To learn more about LaunchDarkly experiments, read Experimentation.

Want to know more? Start a trial.
Your 14-day trial begins as soon as you sign up. Get started in minutes using the in-app Quickstart. You'll discover how easy it is to release, monitor, and optimize your software.

Want to try it out? Start a trial.