Multi-armed bandit algorithms (also referred to as k-armed or n-armed bandits) are a class of algorithms that guarantee you minimize regret when running experiments. The “arm” in multi-armed refers to variations. So an A/B test would be two-armed, and an A/B/C test would be three-armed. And what do we mean by “minimize regret”? One way to think about it is to contrast multi-armed bandits with a more traditional A/B test. 

With an A/B test you would typically give variation "A" 50% of traffic, give variation "B" the other half of the traffic, and then let them run for a week. What if it turns out revenue from B was significantly better than A? Let's say revenue was 10% better. At the end of the week, you now know that you've exposed half of your traffic to a variation that could have been 10% better. That means you missed out on an additional 5% increase in revenue over the week since half of your traffic could have been performing 10% better than it was.

Multi-armed bandits avoid this pitfall by learning at each moment which variation is most likely the best and giving it more traffic. So let’s say we ran the same variations through a multi-armed bandit algorithm, and by day two it looks like there’s a 60% chance that B is better. In that case, the multi-armed bandit algorithm would give variation B more traffic. How much more depends on the specific algorithm that you use, but all multi-armed bandit algorithms are meant to take advantage of the information available at each step to optimally allocate traffic as the algorithm learns more about the value of the variations.

To recap, multi-armed bandits allow you to minimize exposure of a bad variation to your users in a way that optimizes for a specific metric or combination of metrics. This is frequently a good approach for marketing site changes, where you want to expose the most users to the variation that has the highest conversions.

On the other hand, A/B testing allows you to gather much more data before you make a decision. You might want to do this when you’re just testing the market for some new feature and you’re not ready to expose many users.

Can I run multi-armed bandits in LaunchDarkly?

Yes. There are two ways you can do this. One is automated and one is not. To run a multi-armed bandit in an automated way, you will need some code for the multi-armed bandit algorithm that talks to LaunchDarkly to retrieve updated variation values and to set the traffic for the best performing variation. Here’s an example of an epsilon-greedy, multi-armed bandit algorithm that works with LaunchDarkly. You can run that every day, week, or whatever time interval you want to automatically update your experiment’s traffic.

While we do not support running a multi-armed bandit algorithm from within LaunchDarkly (yet), multi-armed bandit algorithms use similar statistical methods as traditional experimentation. So you can use LaunchDarkly’s experimentation tool to run a limited type of multi-armed bandit, which is essentially a non-automated multi-armed bandit, without any additional code. The only significant difference from the automated example above is that you would adjust the traffic manually. In fact, methodologies like this are shown to guarantee much better results than traditional A/B testing and we recommend using it when appropriate. Here’s how you could use it.

First, we’ll recommend making the changes to traffic at one week intervals. This is to avoid day-of-week effects reducing the benefits you get from reallocating traffic. For example, suppose variation B does better on Tuesday but variation A does better on Wednesday. If after Tuesday you gave more traffic to B you would be doing yourself a disservice. By making changes in one-week increments, those day-of-week effects are properly weighted into the algorithm.

With that in mind, here is the approach you should take to run a multi-armed bandit on LaunchDarkly, supposing you have an experiment with three variations:

  1. Give each variation 5% of traffic and let the other 85% go to the default.
  2. After one week, see which variation has the best metric value.
  3. Change the default to the variation which has the best metric value.
  4. Go back to Step 2 and continue this process until you feel comfortable you have identified the best variation.

    A good rule of thumb for identifying the best variation is when you find the same variation winning every week for four-straight weeks. However, in the case where there is no difference between two variations, you may see that they swap the top spots frequently. It’s okay, from a regret-minimization perspective, to just keep running these and swapping them. Although, practically, you may want to pick one at some point. If they really are the same value, a literal coin flip is as optimal a decision-making process for picking between them as anything else. Though if you have more information from outside the experiment, you should use that in making your choice rather than leaving it to complete chance. 

In fact, this is the same approach taken in the code mentioned above that implements an epsilon-greedy algorithm using the LaunchDarkly APIs. 

What kind of multi-armed bandit is this?

Steps one through four above describe a type of multi-armed bandit algorithm called an epsilon-greedy algorithm. With this algorithm, you have some percentage of traffic that is always exploring all of the options. In the case above, this is the 5% to each variation, and you have some percentage of traffic that is exploiting the best option, that is the 85% above.

What about the fact that it is not re-allocating traffic for every data point? It turns out that many multi-armed bandit algorithms in practice do not reallocate traffic for every data point. Sometimes this is due to the cost of computing the value of each variation after every new data point comes in, but often it is to account for the same day-of-week effects we account for above. So this algorithm actually looks just like an epsilon-greedy algorithm you might see in production systems at large companies.

There is another common multi-armed bandit algorithm that uses Thompson sampling. You could also implement that with LaunchDarkly’s APIs, but for now we’ll leave that as an exercise for the reader.

Notice that the process above is really not much more effort than running a traditional A/B test where you would be checking the results frequently anyway. The real difference here is that on a weekly basis, you simply adjust the default variation to take advantage of what you’ve learned so far, whereas in A/B testing you wait until the end of the experiment to act on that information.

Learn how we're making it easier to run smarter experiments with new traffic allocation rules in our recent product update.