Testing Infrastructure Changes in Production
Mitigating risk with feature toggles, canary tests, and A/B tests
Rosemary Wang, a Developer Advocate at HashiCorp, presented at the November Test in Production Meetup in Berlin, Germany. She outlined some of her own experiences with testing infrastructure changes in production. She detailed the many ways in which she has seen teams use feature flags, canary tests, and A/B tests to control the blast radius when making network updates.
"...feature toggles or feature flags tell [you] if something's 'on' or 'off', right? And the thing about feature toggles in infrastructure that I realized is that 'on' or 'off' is a state. You want to preserve the state of your infrastructure."
Watch Rosemary's full talk below to learn from her experience. If you're interested in joining us at a future Meetup, you can sign up here.
Rosemary Wang: Okay, wow. All right, I'm the last one. I have this weird mic on because I have to have two hands for the keyboard, so if it disrupts things, I'm so sorry. As Tom mentioned, I'm Rosemary. I'm from HashiCorp. If you've used Vagrant, Packer, Terraform, Consul, Nomad, or Vault, and those are all the opensource tools under the HashiCorp umbrella of things. Today, I'm going to take kind of, not a little bit of Terraform, but it's going to be more high level in some ways. It's not going to be too much Terraform. So for those who don't know it, it's okay. There will be AWS, so please allow Frankfort not to go down. Okay.
All right, so, I was a network engineer, and I tested in production a lot. It was a bit scary. The first time I had to make a first major network change was for a software to find networking within my organization, and I did a BGP route wrong, and everything went to a black hole somewhere. I don't actually know where the production traffic went, but one of the other network engineers had to come in and was debugging production, it had 100 people on the call, I was embarrassed. Yeah, that was my first major change to a network. But the point was that I always learned to have to basically live dangerously, and I always had to test in production. And the reason why I always tested in production for infrastructure was because users were developers, right? Everybody needed to push and deliver code at all times of the day. So no matter what I did, I had to make sure that a developer could push code, plain and simple, and that could be 24/7.
I work at HashiCorp. We're based out of San Francisco, but a lot of us are remote, so if something goes down and we can't push code, that means PRs don't get merged, so it's pretty disruptive. And the availability of applications depended on the system that I was helping to run, which is hard. It's not exactly the easiest thing to make changes when infrastructure kind of founds everything else, when everything is built on infrastructure, right? So, I kind of sat there and puzzled over it for years, right? Which is, to answer this question, how do you change infrastructure without impacting applications? And what I would do is I'd go to my favorite search engine, and I'd look for changing infrastructure in production. And of course nothing came up, right? Nothing said, "You feature toggle this way your network." No one said, "You A/B test your network," either. So I was like, "All right, I'm going to learn some of these techniques from some developers that I know, and maybe apply it back to infrastructure."
And when I started to learn the techniques and start to think about them, someone threw a twist at me. In the past two years, it has always been, "How do we change infrastructure without impacting applications?" And then an addendum from someone else, "While not spending extra money on staging?" So literally, I was making changes in production because of a cost reason, and not necessarily knowing what was going to happen. So without the luxury of staging infrastructure, I couldn't really do anything. It's not like we could mock the entire network, right? And then just decide, "Oh, it's great. Now we can just move it to production." We had to do everything in production to save money.
So, I was like, "All right, we can't use our shift left testing approach," right? We can't test everything before it goes to production anymore. It's just not practical, it's just too much money. So then I was like, "All right, let's try some feature toggling, let's try some canary testing. Let's try some A/B testing. Let's see if we can take some of the techniques we know about, and the software development space, and apply it to as low level networking and infrastructure as we can." So, I'm going to do all three. Some of them may go well, some of them may not go well. So we'll see what happens.
All right, so feature toggles. Someone mentioned before, feature toggles or feature flags tell if something's on or off, right? If a feature's on or off. And the thing about feature toggles in infrastructure that I realized is that on or off is a state, right? You want to preserve the state of your infrastructure. Yes, you're pushing changes, but the idea is you want to preserve that change to a certain degree. You don't want it to go everywhere. So, if you can preserve the state as much as possible, that's a good feature toggle. You want to inject with a roll forward mindset. This is actually important in Terraform. When you run a terraform apply, it just applies it forward. So just because you toggle it off doesn't mean you reverted the change, it means you had to roll forward and that also changes state. So it's a little bit weird in that respect, right?
The other thing that I also learned that I made a big mistake was don't write toggles in Terraform or in any infrastructure's code from the start, because it became completely unreadable. There's no nice way that you can abstract it into a library and say, "Toggle here, toggle there." It was sort of written in the code even with any of the yaml based, or any of the other kind of deployment manager infrastructure's code tools. So, what does this kind of look like in Terraform? Well, for those of you who are familiar with Terraform, this is an aws_subnet. I use the subnet because I did networking, right? And what was a common thing to do was optimize our network addressing, because it was a waste if we had a bunch of big networks and no one was using them. So we had to narrow them down, and making changes to them weren't that easy, because they would create completely new networks.
So how do you do this, right? So we were like, "All right." We throw in a count, which is in Terraform, right? You count, you say, "If it is true, then create two, otherwise create one." One means the original, two means the two new networks, right? Yes, it's blue-green, it's not perfect. But the problem with networks, right is that you can't actually alter the subnet address space very easily. So what we said was, "All right, we're going to do this, and we're going to insert this enable new network." And so if you actually want to do this in Terraform, the thing I like to do is I like to actually create a toggles variable file, and this just separates the toggles ... I'll increase that size. It separates the toggles away from your variables in Terraform, right? Because you don't want your toggles to live forever.
So here is just enable new network. Done. And let's say I want to create a new network, right? I'll just add it to my Terraform variable files, and I'll go to the [inaudible 00:35:23]. Sorry y'all. And so I'll just enable it true. That's it. And I'll just apply it, and it will ask me for my password in order to apply it to AWS, so Frankfurt, don't fail me now. All right, so what this is doing is it's refreshing existing states. So remember the feature toggle, you try to preserve as much state as possible, right? So all this is doing is adding nine of these things. It will add a new VPC, right? We have to kind of blue-green it again, but the interesting part about this new network is it has a narrower subnet range, right? So the idea is that when it's got a narrower subnet range, we want to migrate instances from the larger subnet to the smaller one, and that way we can tear down the old one and we can reclaim some address spaces. This is actually really common in switches, network switches and when you decide to just roll everything on premise.
So then this thing is going to create. You'll also notice I snuck a little thing, which we'll actually proceed to next, which is the canary. So after this is done, I'll just let it run ... after this is done, what it will actually create is a lovely little canary instance. So this little instance is there as a smoke test before release. The way it goes, actually, some, a QA, who's a longtime QA, he told me if you plug it into the device and it starts smoking, you don't release it. So, this is what a canary is. So I've put an instance in my new network; if it smokes, or there's something wrong, I know I can't release it to add new instances. So I smoke test before release, and then I actually just say, "All right, let me migrate some of the instances and we'll see how it goes," right? We'll do it slowly. It's just actually really cool. You could do it pretty easily with container architectures, and I'll go through a brief Kubernetes example for those who like Kubernetes.
So, canary tests, right? Canaries, I actually just do this toggle so I just say enable the new network, and then that's that. Right? And it will create this instance there ready for me to test, and when I'm ready to test I like to use Kitchen for networks, mostly because it creates kind of a test instance that will trigger and go back to canary, and it will actually call back and say, "Are you smoking, canary?" And then if the canary is smoking, then we all know something's wrong. So, Kitchen for those who have not seen it before, it's basically a quick instance that you can spin up anywhere, and it'll test back.
So what this looks like is a little bit like this. So I have an old VPC, with all the apps on it, and then I don't release the new VPC until I know that the canary is good. So I spin up an instance, check canary, check to outbound's good, inbound's good, all the other connectivity and peering are good. And then I can tear down the old one, move the instances over. Right, seems pretty simple. And it's actually this kind of nice line of testing, right? You're actually testing a network before you release it, which is unheard of sometimes, because sometimes you just can't test your networks.
So, in Kubernetes, if you were to do this, this is the fun part. If you're managing your own nodes, not the control plane, but sometimes in some of the managed services you have to roll out your own images. If, let's say, there's a patch that you need to apply to the underlying operating system, let's say you have a Kubernetes node group that's insecure, and you've got all your Kubernetes apps on that, you create a new node group, and you can actually add a taint. So you just kubectl taint the nodes, the new nodes that you've created, and if they're internal you can allow them, you can tolerate them to schedule on the new node group. But if they're external, you keep them on the old one. Right? And so this is your canary, right? The internal applications when we did this had no idea that they were on a new app, and then they would report, "Hey, there's something wrong." And then we'd roll back, right? We'd migrate them all back to the node group here, the insecure operating system. So, that's just a quick, quick rundown on canary for Kubernetes.
All right, A/B testing. Like I said, I'm trying to keep this fast so you can get to beer. So, A/B testing, it's really interesting because this isn't something that I hear a lot in the infrastructure space, but I think it's actually really important. A/B testing is the idea that you release to two separate users unbeknownst to them, and you measure from a business value standpoint which one kind of gains more business value. So, if one set got this new feature and all of a sudden you're seeing more revenue from it, then you're like, "Oh, cool. That's the feature I want to go for." With A/B testing and infrastructure, it's not really that obvious, right? It's a little non-obvious because you're not able to fully map business value from a user standpoint, all the way to a developer, all the way down to an infrastructure standpoint.
So, what I sort of thought of that was well, infrastructure does affect upstream service level objectives, right? And so there are certain hypotheses that if you make changes to infrastructure, they do affect how much business value you gain from the end of the day. So, these hypotheses are something like, "Does X batch process more quickly than Y?" The reason why I say this is that if you have a certain window of time, plus or minus a certain percentage that you're supposed to process the transaction before your user runs off and says, "No, I just want something else instead," you want to process it really quickly. So, "Does X architecture process batch more quickly than Y?" And the other one, which I've heard most for the past two years is, "Does X cost more than Y?"
So, this is actually a really fun example. I'm very sad I'm not allowed to show data behind this one, so I have to give you the theoretical, the hypothetical. So, there was a situation where there was a Data Lake, and we organically built a bunch of functions to retrieve stuff from this Data Lake, pulling the data, morphing it, transforming it, and then pushing it to a Kafka cluster. Applications would subscribe to the topics, right? And they would just pull stuff from Kafka to do processing. Great, awesome. Well, the problem was that we would get the bill at the end of the month, and the functions were expensive, right? Because we didn't optimize the functions, that's the first problem, and you're charged based on runtime, and we were spending a lot. I'm not allowed to say the number, but a lot of them.
So, the whole question was, "What if we move this to Spark?" The idea is that if we moved it to something that is made to transform data and we manage it ourselves, or we get someone else to manage it for us, would this be less costly than what we have now? So we're like, "All right, we're going to A/B test the applications." The applications have no clue, right, which Kafka stream they're processing from, if it's from the FaaS or the Spark one. So, we'll see which one is actually processing faster and how, in what conditions.
So with this hypothesis in mind, it was really interesting. What we found was a lot of the applications were pretty much of the same behavior, right? They were all pulling it from pretty much the same way, and there was a nice little inflection point that it would ... the lambdas were actually cheaper to a certain degree. But, during certain user behaviors, the Spark one, the Spark architecture, actually ended up costing less, which was really bizarre, right? And the hint is that it happened during certain eCommerce windows. So, if that gives you an idea, right? Certain things when people buy a lot, then the Spark architecture actually was cheaper over time. So we were like, "Well this is a weird conclusion," right? It's a very weird thing because we're saying, "Well, business value is lower cost, so what do we do?" And we actually ended up splitting, cutting over, depending on the time of year, so that's actually the solution to it. But it was a really interesting test, and that was the closest we could get to an A/B test, right?
Anyway, the whole point of all these three approaches, really short, is that it really helps organize infrastructure blast radius, right? We didn't really know what we were expecting, and this forced us to think about it. We were like, "Oh, when we A/B test it this way, we understand the blast radius is probably this." And I think what really helped me, because I came from living dangerously and being blamed for a BGP routing accident, right, was I moved away from risk aversion to risk mitigation. Right? These are techniques that you say, "I'm going to mitigate the risk of making a change that will fail." And the other thing that I learned was that “Infrastructure-as-Code” is heuristic. Even when you look at Terraform, even when you look at these tools, it's close, but it's not ideal. Right? None of these software practices map one to one perfectly with infrastructure, so you kind of get as close as you can, and in some ways you get pretty close. You get enough that you get some really good practices out of it, and a better understanding of what your infrastructure's doing.
Anyway, if you want to talk more about Terraform, problems you're having with it, Vault, any of the other HashiCorp products, feel free to always reach out to me, and if you would like these slides, they're on that link right there. Thanks.