Michael Gillett, Solutions Architect, Win Technologies

How Betway Tests in Production: Hypothesis Driven Development

Betway has been following a "test in production" approach to building software for a few years. They test in production for two primary reasons: to validate business hypotheses and gain confidence in technical implementations. In this session, Michael will share tips, experiences, and lessons learned from testing in production.

Michael Gillett

Michael Gillett

Michael is a Solutions Architect at Betway and has worked across a number of the web teams at the company from Bingo lobbies to Sports homepages. He is also a Microsoft Windows Insider MVP and fan of all things Star Wars.

(bright music) - Hi, I'm Michael Gillett, and I work at Win Technologies as a Solution Architect. And today I'm gonna be talking about how we do testing and production at Betway and how we use hypothesis-driven development to do that. So I'm gonna first off by just talking a very little bit about what hypothesis-driven development is, and it is the idea that software is designed and implemented based off of hypothesis. But I question whether it should really be called hypothesis-driven engineering. The reason for that is development might be seen as more of a craft. Things are unique and bespoke, and a lot of software is certainly treated in that way. However, when you start introducing the idea of doing things from hypotheses, there's an element of a scientific approach there and some form of quality, and checks, and some repeatability, and therefore it feels more like engineering. But I could talk about that for awhile, and we don't really have time for that. So we'll have a look at that in just a minute. Then I'll move on to what we've found with doing testing and production and what hypothesis-driven engineering has enabled us to do. And then building on that experience, I'm gonna talk about a new hypothesis framework that we've started to adopt at Betway, and how that's looking, and what are the things that we do within that framework, and how does it help us? So looking at hypothesis-driven engineering, the first and probably the most obvious is a business hypothesis. So this here is about validating an idea. Someone somewhere has come up with an idea that they think would be better than our current status on the website. And so what we need to do is prove is that idea actually better? And the obvious one here is to do an A/B test where we create a hypothesis, we then do some work to test that hypothesis out. We can split traffic between the variations. Normally, it's a kind of a 50-50 A/B test, but obviously it's possible to have multi-variations as well. We log all of the information we possibly can that we think is gonna be relevant to that experiment and to either confirm or disprove that hypothesis. We then can analyze that data, understand whether that hypothesis was correct or not, and then form some recommendations off of that. And that allows us then really to understand whether that hypothesis was correct and whether we should look to really implement that feature. And I wanna do, just talk over it a little bit about one that we've done on our site and a quite a high profile one 'cause this was our home page. And it's quite a busy home page, and that had worked for while, but we started to wonder whether there was some issues here. And the reason being is these three links here are all actually trying to drive the user to the same place which is ultimately to register on our site as a sports user. But we've got three places on our home page to do that within the top half of the page. There's no scrolling going on here, but there's even other calls to action here for registering users. And it was quite noisy, and so a hypothesis was devised of if we just have a single link on our home page that would result in more registrations, fewer distractions, much more focused on what we're trying to get the users to do on that page. That led us to create a brand-new UI for our home page which you can see is a lot cleaner, a lot simpler. Far fewer visual distractions. And so now we have done the work involved in order to actually test this out, which of these two home pages would convert better? So we're in real a A/B test here. So the way in which we can do this in production is we implemented a redirect on our old home page. So if any traffic went to the Betway.com, we could based on whatever rules we liked redirect the user to our new design for our home page. And we initially started that with just being turned on for our devs and QAs we were building that new home page. We wanted to make sure that it was working in our production environment. And we could do that as we started with very little, to being ready to go live with that. We could constantly send individuals to our new home page. Once the devs and QAs we're happy with that, we could then turn this on for key members of our business and key teams. Obviously, changing the home page is a pretty big deal. There's a lot of interest in making sure that the functionality, and the look, and SEO, and all of these kinds of things are considered and they are correct. And so once we have turned this on, this redirect on for those teams and those individuals and got their sign off, we were then in a position to start rolling this out to get us to 50-50. We didn't wanna go to 50-50 straight away. So what we did initially, we just rolled this out to 5% of our UK and English language users. We wanted to gather some information from those users. Obviously not directly, but with the logs that we have we could gather information, understand what those users were experiencing, and hopefully they were experiencing a lovely experience. And once we were happy that they were, we then rolled this out to 25% of our UK and English language users. Again, this is about just making sure that everything works as we're expecting. Obviously, the more users we roll this out to, the more variations in device types, browsers, software versions, network speeds, all of this stuff kind of gets factored in. And then eventually we got to a 50-50 split between our old and new pages. It is at this point we could then run our experiment for a predefined amount of time and until we reach statistical significance. Now we're into the real A/B experimentation element of this. And what we found with this new home page was it was actually 25% better for our successful registration rate which is a considerable improvement over what we had before. And so what that means is that hypothesis was proven. Awesome, really, really good, nice example of how an A/B test can work. But there are a few things that come along with the way in which we can do testing in production. So there are some nice technical accomplishments that we achieved by doing this in this manner. And the nice thing here is we have no rollbacks, and normally changing our home pages can be quite a big bang thing. You're changing your home page. But what we found was we had no rollbacks. And the reason for that is because we could target users as and when needed. So that initial 5% rollout, if we weren't happy with some of the things that we've been seeing in our logs, we can put that rollout back down to 0% of our public. We have no urgent pressure to fix something or to roll back because we've just stopped our public getting to the new home page. So it's a really nice, safe way of doing that. We had no critical alerts, nothing kind of lit up in our alerting systems. And again, it was because it's this nice, safe way of going live with this. And a really nice thing was we found that all exceptions were decreasing over time. Now it might seem odd that we had any exceptions, but we were tracking the exceptions in a graph like this. And what this shows was each of these is a... Each bar here is a different day, but the colored sections of each bar is actually the numbers of exceptions for the different browsers on mobile devices. And as you can see, it decreases over time. When we could see this on a near real-time basis, and that allowed us to make very, very quick decisions around whether we wanted to continue to roll out or whether we wanted to actually revert back down to a lower kind of blast radius of the number of users who are experiencing this. And as you can see over the course of a few weeks, we actually got it down to zero browser exceptions. Now some of these browsers were older. On mobile devices, we weren't able to test so easily in the office. All these kinds of things, but it's a really nice, safe way of knowing that what we're rolling out isn't completely blowing up for our users, and we can make judgment calls as to whether we keep our rollout to that level or whether we start bringing that down a bit. So a really, really good way of being able to roll things out. So that was business hypothesis. Now we're shifting gears a little bit and moving on to technical hypotheses which we actually have seen in my previous example. Now the idea with our technical hypothesis is that we wanna validate an idea. Sorry, an implementation for an idea. So the idea being here is maybe we've refactored something. We've improved some caching. We've changed something that might impact performance. Well, we don't wanna just go live 100% with that because if we've really badly overlooked something, we might have some real problems for our users. So what we can do here and it's what we saw in our previous example, is we can target these roll-outs to certain groups or small percentages, so 5% an obvious one. We could target a particular country and just roll out a new feature to users within that country. We can target users on a certain device type, maybe mobile, or we can start doing more complex things. So we could start factoring in the domain that the user is on. And even whether they've previously logged in. Maybe we've changed something in the way that we handle returning users to our website. And now we wanna check that that new thing isn't going to break or offer a degraded performance and experience to our users. So we can now roll out features in very safe ways. It's different from a business hypothesis because we don't necessarily have a view on whether we're gonna test whether this is better or worse and ultimately decide to go with A or B. Rather what we wanna do here is just validate that we haven't broken things or made them worse. And if we have, then we'll rework it and go live. But really it's about we know we're gonna end up with this being at 100%. We just need to do this in a very safe way. Then there's another element to this which wasn't immediately apparent to us, but we've adopted it for some things, and it's been really effective. And it's what we're calling kind of technical testing. And the idea here is once you're in a position where you can test business hypotheses and technical hypotheses, then you are in a... You're well placed to actually kind of emulate lots of different scenarios in your production environment, but doing so in a way that doesn't necessarily impact your production environment. So what we were doing here with some of our client applications, they make calls to downstream systems. But when we've got big sporting events on, we wanna know that our front end is gonna stand up. We also need to know that the backend and downstream systems will stand up, but we might wanna test each one of those systems separately to understand how they're gonna work. And so what we could do is we could load test our front end application but pass through a custom header, or a cookie value, or query string value. And eventually that would then mean it would hit a code block where we're using a feature toggle, and we can now actually determine whether we should be hitting our real production downstream systems or hit a mock system. And that allows us to emulate browsers and different platforms, different devices, different network speeds in our actual production environment with our production stuff, but it doesn't impact our downstream systems. And that's been really good at checking things like autoscaling or triggering as we need to make sure that our apps will stand up to the kind of sudden influx of users we might get for certain sporting events. So that's kind of what it looks like for us to do hypothesis-driven development or engineering. But there's been a lot that we've learned along the way, and I wanna share some of that experience with you as pointers and tips. And just to get you thinking about things 'cause once you're able to do this, there's a lot of things that we've learned that I think are really valuable for others to know as well. So one thing that we've done on a couple of teams and for a couple of our products is to adopt trunk-based development. So the idea here being is obviously if you've got lots of environments from your local and you have to go through those environments to get to production, that can actually be quite time consuming. A lot of effort and resource can be spent on making sure that those test environments are like production. There's no point in signing something off on a test environment if it isn't like production. It's fairly ineffective at doing that. And so what we can do with trunk-based development is we can develop something on a local machine and then push it to production within a toggle. Now that thing could just be a literally a one line of code change and it can just be that toggle, but that toggle only gets turned on for that developer, or that QA, or maybe a product owner, or something. It's in the production environment. And now that developer continues to refine and improve that feature over time. But all they're doing really is pushing to the production environment to validate that that's where it's going to work. And it is a very effective way. We can even then once we get close to a feature being ready, we can just turn that feature on to key stakeholders, get their sign off, get their feedback about it. And it's a very safe way of developing features knowing that you're developing against the environment it's ultimately gonna run on, very efficient and effective way of doing work. Another thing that we've learned is it's very important to track a device and not a session. And what I mean by this is what you don't want, certainly with the home page experiment from earlier, you don't wanna to use it to come back to the home page on a couple of occasions and see different home pages. We wanna offer the same consistent experience to that user whether they're in variation A or B. So they might clear their session. They might clear their cookies. So if you can track on a device, it does allow you to keep offering that same experience as a user kind of moves around over time. And we actually then adopted that as a shared cookie which can be passed around all of our applications. And so if there is a toggle that actually spans multiple apps or experiences, then we can make sure that that user or that device is being treated the same which is really good. And once the user's logged in, then we'll just resort to using the username to track them which is obviously a far better way of doing it. This is a new thing that we've started doing. And it's probably something. I could talk about in depth for quite some time, but there isn't time for that now. But what I just wanna talk about here is obviously with automation tests, you wanna test many of your scenarios that a user might go through just to check whether it's working as expected or not. When you start introducing feature toggles to those experiences, you start quite quickly increasing the number of potential journeys or experiences the user might have. And when toggles end up being coupled together or you end up with multiple ones in a similar journey, you do have an exponential increase. So we've adopted within some of our brand-new systems, we've actually introduced the idea of automated automation tests. The idea being here that the code is gonna evaluate how many possible variations could have user experience through this application, through this experience? And it will programmatically figure out all of the automation tests that it needs to run. So a lot less manual work needs to be done to run our automation suite which is really, really powerful for us. And we can now be confident that certainly in some of our mission-critical applications, we can run every possible scenario of all of the LaunchDarkly variations working together and be confident that it works every time. The next thing is less technical, but it's something that I'm calling scheduling complacency. It's probably not the right word, but the idea here being is if there's a number of teams that can do testing in production in this way, what we found was some teams might now work a little bit slower when working in a project because they are aware that another team can release stuff to production and just keep it turned off. So there isn't a real urgency perhaps for the other team now to race to meet them to a certain deadline. But that can result in a lot of wasted time and effort. If another team kind of cracks on with the work, and gets it to production, and then sits there for weeks until the other teams have caught up, that's a really fairly inefficient way of actually working. And so it's the idea that just because you can do testing in production, you need to make sure that teams are still kind of working towards similar deadlines and goals. Otherwise there can be these massive discrepancies between when work is actually going live which is pretty wasteful. This is an interesting one, and it makes use of permanent toggles which we try not to do. But the idea here being is once you're in a position to do a lot of testing in production, you can actually do interesting things that isn't really around testing, but is making use of the infrastructure you've now got. So we've got a debug mode on some of our apps where we ship both minified and unminified JavaScript. The debugging mode is off for users by default. They'll always get the minified version of the JavaScript file. But if we've got some problems and we want a dev to check out what might be going on with a more verbose, unminified version of JavaScript, well, we can turn it on for that user.

It could also be that a customer calls our call center and is having a problem. The call center could turn on debug mode for that particular user, and now we start getting a lot more logs coming through for just that one user. So it's a nice way of being able to kind of leverage the infrastructure of testing in production to better understand what a particular user might be experiencing or a particular bug that might be present. And then moving on to the actual experimentation with doing testing in production, it's really important to understand what success actually is because most people will have a slightly or maybe very different opinion of what success is gonna look like for an experiment. And it might not be clear when someone comes up with a hypothesis what success really looks like. And so it's very important that everyone involved understands what the goal is. So that needs to be set and understood, and the success metrics needs to be understood. What number is it that is ultimately gonna determine if this was a success or a failure? And it's interesting, we've had some very good conversations around what does success look like for different types of hypotheses? And it's not often that we found that everyone is on the same page at the beginning. And it's important to do that before you run the experiment. Otherwise, you run the experiment, and then there are people questioning whether you actually even looked at the right stuff. And it's important to make sure that you're all on the same page before you kind of spend any time running an experiment. And then off the back of that as well it's also very important to understand the sample. Who are you actually gonna run this experiment on? It might be the obvious 50-50 A/B test, but maybe the person who's suggested the hypothesis thinks is a fairly high risk. And they might actually think it's better just to target just a 20% of users and compare it against the 80% which is interesting. And certainly conversations we've had have presented that in the past. But then you can target on all kinds of stuff that we've already spoken about here. What about the device type? What about the country, what about subdomains? All this kind of stuff, that needs to come out. You need to understand where is the experiment gonna be run? And make sure everyone who's involved is happy and understanding of that. And then this is a really important one that we potentially have only properly understood in the last couple of months. But the idea here being is you don't wanna approach a hypothesis with a particular result in mind. The idea being that if people think that the hypothesis is going to be successful, then we probably shouldn't bother running it as a hypothesis because if we think we're right and we are right every time, then you don't need to go through the extra overhead of doing it as a hypothesis with an experiment. But we know we're not always right. So we shouldn't assume that every hypothesis would be successful. However, that is exactly what we have been doing. And the way that we know that is if we found a hypothesis after we've run the experiment to be successful, we wouldn't ask any more questions. We would have a look at our data, go, "Yep, that's correct," and put it live. If however it hadn't passed our success metric, we would do a lot of questioning. Why isn't it passing? Was there a marketing campaign? Was there a particular event on in a country that has kind of skewed the data one way or another? And we had all of these questions, but they were only ever asked when a hypothesis was unsuccessful. And that told us that we had a massive bias, and only the successful ones did we ever think would actually... We thought all hypotheses would be successful which is a really bad way of doing it. It also meant that we weren't doing the minimum amount of work possible in order to prove the hypothesis. 'Cause we always thought they'd be successful, so we actually baked in quality and did a more elaborate solution than is needed to just validate whether the hypothesis was successful or not. So a very interesting one to be aware of. And kind of following that and some reevaluation of how we're working with hypothesis, that has then led us to the hypothesis framework. And I wanna just share in the last few minutes of this talk what that framework looks like for us and how we're using it. So there are six steps to it. The first one being well, someone needs to submit a hypothesis. We then, that looks like a form. We used to have a form like this. We've actually modernized it a little bit more, and it is now an automated form system. But the kinds of things that we're asking for here is we wanna understand the problem space that the person's coming from with their hypothesis. We wanna understand what the hypothesis and what are the acceptance criteria of this piece of work and of this hypothesis? We wanna understand what the success looks like. Again, coming back to that feedback that we've learned of everyone needs to be on the same page and understand what does success and failure look like? And we need to know the sample. So we get this all up front now, and everyone can see this in this document and understand what does this hypothesis really mean? What does it look like? How are we gonna test it? From there, it then goes to our product team who will review that hypothesis. The idea here being we may have had similar hypothesis like this in the past that we have experimented with. And whilst we aren't ruling this out, maybe we will batch them up. Or maybe it's just not something we're ever gonna entertain. So product will review it. If our product team is happy with that hypothesis, it then comes to the dev team, the QA team to analyze that hypothesis and understand what is the simplest way of testing that hypothesis? Once we have understood that, we can then go and implement that simplest way, and it could be as simple as adding an image of a button and recording the number of times that users click that button. And maybe we show a popup or something to the user to just say oh, this feature is coming soon. But we started to gather insight as to whether that feature is even what the users want before we've even thought about designing and building the actual feature. Once we've then run that ex- Once we've done that simplest piece of work, we can then run and perform the experiment based on the sample and the kind of the test that we want to run. And once we've performed the experiment after a certain period of time or we've reached statistical significance, whatever it is that is the success metric, we can then analyze the experiment. And in the analysis, we'll conclude whether the hypothesis is correct or not, and we can form some recommendations. The recommendation could be this feature needs to be reworked now to be production quality and put live as soon as possible. Or it could be this was unsuccessful, but maybe the sample was wrong. Maybe the way that we approached it was wrong, or maybe we shouldn't reevaluate this for another six months when we don't see a reason why this is gonna become successful if we kind of rejig the sample or anything. So that's our framework that we're using. It's been defined in the last few months, and we're slowly expanding this out to more areas of the business to make use of, but we're finding it really, really useful. We've tried to automate it as well. We're a Microsoft tech house, so we're using the Office Forms system as our kind of actual hypothesis specification. And then that creates, once that's submitted, it creates a new Teams channel, and we keep the person who requested the hypothesis, we keep them up to date throughout the whole process. They can see very clearly what we're doing. So it just leaves me to say that I think it's really important to use hypothesis-driven engineering. It makes things so much safer, so much easier. It takes a lot of pressure off, and it starts informing and baking in quality into the products that we build. So it's really valuable. And when you're able to do hypothesis-driven engineering, it's quite important to define a hypothesis process. And I hope some of the things that I've spoken around kind of give you some idea of what that process might look like. And it's probably different in each company, but I think it's important to have. And so finally, thank you very much for listening, (bright music) and I'll be answering questions in the chat.

Ready to Get Started?

Start your free trial or talk to an expert.