Right Grid
  • Overview
  • Transcript
Trajectory

Champion versus Challenger

Simon Herron LaunchDarkly

There is a myth that Experimentation or A/B testing is only useful for small, incremental changes, and that we are only here to mine for winners. This is not true. Whether we're trying to optimize the total visitors to an application or the value of the code we just deployed - if we take the time to understand and learn, then we stand to gain more than we lose. In the next 10 minutes, we’ll explore 4 key fundamentals to effective experiment design.

Downloads slides

Simon Herron

Simon is a Solutions Engineer in London, UK. He has been designing, building, running, and analysing experiments for well over a decade before he arrived at LaunchDarkly. When not Solutionizing or Experimenting, he can be found being a Dad to his four-year-old son or poking around in Stack Overflow.

(bright upbeat music) - Hi, as the description states, my name is Simon Herron, I'm a solutions engineer for. LaunchDarkly's London team. One of my responsibilities for the team is acting subject matter expert for experimentation, which kind of stands to reason, as I've been a practitioner of experimentation for just over 15 years now. What I'd like to do is explore experimentation design with you for the next 10 minutes. And what we'll do is we'll try to dispel some myths in the process. Now, if I said to you, I could build a full stack application because I had made my best friend a website using WordPress. I'm pretty confident I'd get laughed out of the room or at best get a few very skeptical looks from you. That's how I feel when I talk to people about their experience with experiments, there's this common misconception that experimentation is plug and play. Yes, tools can make the process of setting up experiments easier or faster, but if you aren't designing or setting your tests up correctly, you're gonna have a bad time. Experimentation like feature flagging provides us the ability to explore risk in a controlled way. Where feature flagging gives us a means to isolate the code we release experimentation helps to tell us and define what we should deploy. These are two strategies out of the same engineering toolbox. So the question would be what constitutes a robust test design then? Now, everyone knows how to write a solid hypothesis, right? Sadly, that's not been my experience. And idea or question is not the same as a well thought out hypothesis. And ensuring that it's robust and as discreet as possible is so crucial in the process of getting what you want out of your experiment and not being led by your assumption by that, I mean, not create creating something that appeals solely to your own personal bias. There's also this peculiar idea that a hypothesis needs to be summarized into a single sentence well, I'd safely say a one page hypothesis may be overkill, providing a degree of detail on where your thinking originates and how you came to your hypothesis is only gonna make your post-test analysis clearer. A single sentence isn't going to maximize the opportunity for learning and decision making. The two foundations of why we experiment in the first place. 

On the left hand side of the screen, you can see some examples of poorly or subjectively poorly written hypotheses and on the right, how we can bolster them, how we can make those particular hypothesis better by adding additional information. My advice would be to think of the initial setup as if I do X to Y, I expect to see Z, if you take that as a very small formula and then build up around it, you're more likely to be successful when it comes to putting together a robust and constraint hypothesis. So what are our takeaways for this? The core considerations for developing a watertight hypothesis, be specific. The more specific about the experiment, the expectations you have, the easier it is to determine whether or not you have a real effect and thus what you need to do next. It also has to be measurable. There's no point in having a hypothesis that isn't testable at all. So know that you have robust KPIs, Key Performance Indicators to work towards and review them regularly. I have customers that will review them quarterly half yearly or yearly, that depends on you. It also needs to spawn other hypotheses. So if you can generate further hypotheses from the one that you have, it means that you should be able to build your next theory from whatever results you're likely to get on the current one, oftentimes before you've even started, that's correct current experiment. And most importantly, if I could summarize this all into one thing, it should create value no matter what the results of the test. So no matter what the result, you should be able to get some degree of value from it because of the way that you have written your hypothesis. 

Sample-sizing, well, we run our sample-sizing analysis during the planning stage of our ideation. So sample-size informs the time it will take to run the test. The number of variations we can run and whether or not the test should be a priority or where it should fall in our prioritization. By not doing this, chances are, you're going to run into a problem called peaking. Peaking is effectively where you call an experiment early because you think you have one thing when actually you have another. Data is really noisy and often something that looks like a result or positive result or a negative result early on because your sample size isn't large enough, pivots and then changes rapidly often in the other direction. You're also gonna see a lot of false positives. So a false positive is where you think you can see a winning variation or a variation that's moving in the right direction for your experiment. But it's basically an effect where there isn't one either because you're too close to the data that you want to see, or you're too far away from it. Therefore thinking that there is no effect at all. Now, imagine telling your boss that you can improve revenue by five times because you called your experiment early only to find out a couple of weeks later, should you leave the experiment running that that data has pivoted, and now it's basically saying you're gonna lose the same amount, nobody wants to be in that situation. There's a line of thought that experiments should run to completion within two weeks. My experience is this generally stems from a peer wanting or expecting to be able to exert their deadline on a practice that they don't really understand. Sample sizing will provide you with an estimate of when that experiment will be safe to check timeboxing an experiment or setting duration without actually checking your data is going to leave you with a false positive or the wrong information, inaccurate information, artificial deadlines do not lend themselves to success at all. 

When setting up an experiment, the only thing that's actually within your control is the number of treatments or variations that you want to run. If you have the traffic, you should definitely try to run more than just a single additional variant. So rather than just A and B try C, D and E. The worst thing I think about AB testing is the fact that people call it A B testing. So they think automatically that it's just a control plus one. Now, why is this such a problem? Well, think about it, the more shots you take, the higher the chance you'll hit the bullseye, more variations means a higher chance of discovering something valuable. More variance gets you to think harder, be more creative and take more risks. And as you should know, risks is generally where the good stuff is. Measurement is hard, and there are many pitfalls in this area. First and foremost, once again, only measuring a single metric for an experiment is far from ideal. Well, why? You're reducing the chance of discovering as you're looking for an answer only in one place, only on one dimension. If you increase or decrease this primary metric, there's a considerable chance that you're affecting other metrics in the process, but how do you know? Choosing the entirely wrong measure for the test is also a common problem. And if you only have a single metric that you're measuring on, and that's the wrong one, you've effectively scrapped a week more of work, and you have to rerun your experiment to recapture that data. Look at your business KPIs and work backwards from there. What do I mean by that?

Let's take this example, think about the smaller measures that contribute to your North star metrics, your primary metrics, an easy example of this that. I've chosen is in e-commerce. And I chose this primarily because we all shop. So it just makes it a little bit easier to kind of consume what is revenue? It's a product of the number of customers. And the average customer spend. The number of customers is a product user acquisition and user retention. And average customer spend is a product of the number of items per customer and average cost of item. And so each branch breaks down. Now, if we think about those, if I could increase my customer retention, then my revenue increases. If I can increase the number of items per customer, then my revenue will increase by breaking your KPIs down to more digestible contributors or micro conversions. We can better map out and focus our attention on the things that matter. Now, what should we take away from this? Well, you can add as many metrics as you like to an experiment. It's best to think about multiple metrics and likely up to about seven, somewhere between seven and 10, perhaps. Now, well, I did say that more measures means a higher chance of learning something too many metrics can be overwhelming and lead you to information paralysis. Bearing in mind that experiments are supposed to be able to give you some degree of guidance on a decision. And if you spend too long on the analysis, the chances are, you may well have missed your opportunity to capitalize on whatever it is that you've gotten from your experiment. So as we move closer to the summary, let's just think about what experimentation is. It is at its core about learning it isn't about wins and losses. I'll say that again, experimentation at its core, isn't about wins and losses. Running experiments is about gathering information, analyzing and gathering insights, and then acting on those insights. Everything within an experiment design should drive towards a clear outcome. And the next iteration of that test before your current test has started running. Think about questions such as, Well, what happens if I get a positive effect? What do I do then? What happens if it's negative? Do I stop it, do I wait? What can I take away from that information? Most importantly, what happens if the experiment amounts to nothing? What if it is a statistical tie? How will I handle that situation? What will I tell my peers? There is value to be had in all of those scenarios, but it's about writing clear, specific hypothesis, knowing exactly what you're going to do in terms of reacting to the outcomes of that hypothesis and choosing of course the right metrics in order to measure those outcomes. 

Okay, so to wrap up the takeaways from my presentation are foundational. I hope that this provides some degree of insight into how a practitioner sets themselves up for success. I would like to leave you with one last thought though. It's my opinion that engineers are effectively magicians in business, if you want to do something magical in today's age, then it's likely an engineer who makes that happen. They build impossible things from words to syntax. Just think about that for a second. If that's the case, then the possibilities for software experimentation are pretty limitless, thank you. (bright upbeat music)