Testing Configs in Production

126
Jordan talks at TIP Meetup

At our Test in Production Meetup in September we heard from TR Jordan from Slack. He spoke about how his team from Turbine Labs tests configs in production.

“I spend my days rolling out Envoy in larger environments, and that’s mostly an exercise in writing a ton of config. Rolling it out to a huge number of servers…there’s some hard-learned lessons there, we ended up in a lot of conversations about how to safely roll out configuration, and how to really understand the behavior of this and build confidence in configuration files before going to production.”

Watch Jordan’s talk below to learn more about how his team tries to get configs as code, phases rollouts to prevent downtime, and moves what they can into machine-driven states so they can focus on important config changes. If you’re interested in joining us at a future Meetup, you can sign up here.

TRANSCRIPT:

Hey everyone, I’m TR Jordan, head of marketing and a bunch of others stuff at Turbine Labs, new startup life. And tonight, I want to talk about testing configuration in production. So I think there’s a lot of interesting things to talk about here. First, why should you trust me versus anyone else? So at Turbine Labs, we essentially sell a service mesh based on Envoy. So a lot of what I spend my days doing is working with customers, rolling out Envoy in larger environments, and that’s mostly an exercise in writing a ton of config, which is fine. And then rolling it out to a huge number of servers, which is mostly fine depending on how you do it. So there’s some hard-learned lessons there, and we ended up in a lot of conversations about how to safely roll out configuration, and how to really understand the behavior of this and build confidence in configuration files before going to production.

So configuration is, I think there’s a broad spectrum of it. So really anything that is going to influence the behavior of your application and is declarative is something I’d lump under configuration. So it could be as easy as what resources does this application need to perform, whether it’s memory or a number of processes on a rails process, or EC2 two instances, or any sort of configuration around where this pod should be placed in Kubernetes. It could be how to connect to various other parts of the application. Where’s my database, where are my dependencies, or it could be just a direct configuration of the application’s behavior like Feature Flagstone. And all three of these types of configuration tend to have a specific, have a couple of different pieces, or a couple of different ways that they behave versus something like code.

So really, the bulk of this talk is going to be talking about how is configuration different than code, and what does that mean when you’re trying to build confidence in it and how your and how you actually take that to production, and what’s special about that approach. So there’s three things that we’ll talk about. Configs tend to be deeply repetitive, they vary by environment, and they are not always driven by human changes. So I think before we dive in, testing and production, I feel like explaining it is a little bit sign of the cross. I should say something about what I mean just to make sure we’re all on the same page. How many people have read Cindy Sridharan’s excellent article on testing in production?

Cool. Enough people. It’s fantastic. You should definitely check it out, and this is shamelessly stolen from that. And really, it’s not just about testing in production, but also pre production testing. There’s a huge number of tools and techniques available to people who are trying to build confidence in their code as it marches from an editor on someone’s laptop to usage at scale. I think the thing to keep in mind is that it’s easy to get overwhelmed with all the different tools. If you have something very simple, especially in configuration land, it can be easy to say, “Well, this configuration only affects a small part of my environment. It’s simple, it’s easy to look at.” In that case, you don’t need a lot. This talk is not about that configuration. This is about the 7,000 line Nginx file that handled a million requests a second at the edge of a major retailer.

That sort of configuration is incredibly dangerous because it looks a little bit like simple configuration. So there’s a few techniques that we can use to try and attack this complexity, break it down and make it something manageable to use. So config tends to be deeply repetitive. Why is configuration so deeply repetitive? I was talking with a company last week, and they literally told me that the Nginx configuration for their edge servers, which is 7,000 lines of one file that they push down in, that seemed fine most of the time, except when it wasn’t fine. And then it was very not fine. If you’re deep into microservices, a similar sort of configuration around how traffic gets between services can get really confusing because if I have 500 services, 150 of them are written in Go, and 75 written in Python Two and Erlang. All the ones that are written in Go should probably look about the same, and how they communicate with each other should look about the same, but they don’t always have to look the same.

And that’s where you end up with a lot of repetition because 145 of those 150 Go services all need to run on two instances, except for the old monolith which runs on 1500. And that can be really challenging because when someone decides to update that, it’s easy to miss. This kind of config generates a lot of diffs that look like the same plus and minus line repeated 75 times, which is great if there’s only 75 instances. If there’s 76, then you’ve missed something. It can be easy to break. So how do we test this kind of configuration, how do we get our arms around it and make it easy to work with? We have techniques for testing code in production. Let’s try and make this look more like code. So for example, one of the stories that Lyft likes to tell about how they work with Envoy is their key insight around how clients and servers in microservices should interact with each other is that clients should not control how they talk to a server.

A server should control how clients talk to them because then if you have a behavior like requests that are over 500 milliseconds are just broken and should time out, and I’m never going to meet my SLA. If one client implements that behavior, that’s wonderful, except that there’s a whole bunch of other clients that don’t. So if you take the configuration, which has all 1700 clients, or all 50 clients, and you force everyone to update constantly, that’s prone to error. It’s going to be very, very difficult. By turning it into code, creating a DSL or using templates, you can actually build this into your build process, and you can build an artifact that looks a lot like whatever you’re pushing out to production. You can use all the same tools that you do now, either around phased releases or quick rollbacks in order to make things a little bit more robust.

But that’s not, you can’t turn everything into a little DSL by trading complexity back and forth. One of the big problems with configuration is it actually deliberately varies by environment. So if you have a piece of code that says, “If I’m in development, do the right thing. If I’m in production, do the wrong thing.” That will almost always fail and you will never get it right until it goes to production. And configuration can be this way. If your database is on local host, cool. In staging, it moves a little farther away. And in production, it moves to a totally different place. You can’t test the configuration of the production database without connecting to the production database, so you’re forced to. And until you try it, you won’t notice that hosted DB has three D’s in it, and that’s not gonna work at all.

So if you have this kind of variation, which a lot of configuration does, the primary approach here is to phase the roll out. Not every client needs to have the same configuration, and I think that there’s been some wonderful work done in tools like Console and the default behavior in Kubernetes, which encourages operators and engineers to put all of the configuration into one place, and it’s centralized and it’s easy to maintain and use, but it also means that everything is a big switch. Everything requires a full fleet change by default. By rolling out to incremental servers, again, trying to treat your configuration a little bit more like code, you get this sense of being able to deploy a small amount and then look at your metrics for just that service.

A story from one of my friends, I think I can just say this is Amazon because it was like 15 years ago, they were deploying a new database. They figured they needed,  they’re splitting a couple of data stores up, and they needed to change all of the clients to use, instead of one giant database, they were splitting up the data and the A and A prime. The work had gone well on this, and they’d written all the code. It was behind branches and feature flags and it was all good, and they started to push this out to production, and they figured they’d make a config change. Okay, let’s make sure that all the traffic goes to the new database. At the same time without telling them, the database team had moved the home of that data from the West Coast to the East Coast, and every machine that they turned on went from being able to serve requests in 200 milliseconds to five seconds because it was making a 100 trips between the two coasts.

It didn’t take down Amazon because they didn’t have a problem across all of their servers. They only changed config on one server at a time. And because they were monitoring this based on how they were doing the rollout, it looked a lot like a standard, I’m gonna flip a feature flag. I’m going to drag a release slider or do a rolling deploy, and it prevented a real outage. The downside of this, of course, is that if your configuration, or your different parts of your application have different configs, now you have config drift. One of the things that we strongly recommend when deploying any sort of meaningful config is make that a top line metric. Make config drift and config sort of split brain, something that you look for actively. And most of the time, you should only have one version of config in production, and you should ask clients to report back what version they have, and basically graph how many unique versions you see. If you see two, that means someone’s doing a rollout. If you see three, maybe you’re doing a lot of rollouts. If you see five or six, something is broken.

And then finally, the last bit of configuration, which is debatable, is that any operation can modify configs. Facebook has a fantastic post called Rapid Release at Massive Scale, and they’re Facebook, so they have unique problems, but there’s some interesting lessons in there. And what they do is their volume of incoming commits to the main code base is so high, you can’t essentially ask someone to say, “Can you look at this release and say it’s good?”, because there’s just too much going on. So what they do is that they constantly push commits into a stream into master, and then occasionally the build system pick up one version, and auto promote through an employees only, and then a small number of live users, and then to all users.

In a lot of systems, if you were to sort of back this out and say, “Oh, that’s cool, I want this,” one place to say is, “I’m going to put in a configuration value, maybe to my deploy and release system, maybe it’s in the code base itself. I’m going to put some version of that says how much of this traffic or how much traffic should go to the new version.” And the problem with this is now you have this enormous configuration surface area that you’re suddenly tasked with monitoring what versions are out there, what percentage of teams, or what percentage of users are seeing this, has anyone changed anything recently? And you’ll see an enormous amount of churn which basically makes anything that looks like a normal Git workflow feel really uncomfortable. Can you imagine reading a Git log that’s 99 percent robot updated from 79 to 80 percent released, and one percent I pushed a change that rewrote the front end in React.

It’s not a great experience, and it can be really hard to debug and really hard to understand what’s going on. So in this case, the primary goal in simplifying this is trying to separate what is human driven intentional config, and what is application state. In Facebook’s case, the configuration is what are the levels that I want to roll out to. Employee only has one. Two percent is another, 100 percent is the final one. Everything under that is just application state. It’s equivalent to how many users are in your database and how many of them are active or have their data being cleaned up. There’s a lot that you gain simply from saying, “I will only configure a certain number of operations,” and I will only allow human control over these operations. And it doesn’t all have to live in a get repository, but it needs to be a deliberate configuration action. And with that, there comes this sort of simplicity of being able to say, “Most of the time, there’s no configuration change. Deploying constantly. It’s just what we do, our monitoring takes care of doing rollbacks, and I get an alert on that.”

And when someone changes the configuration to say, “Well, I think our canaries aren’t catching enough, we need to go from two percent to five percent canaries.” That’s a change that you watch closely and you watch the impact of, and it separates the idea, or it brings clarity to what needs to be paid attention to and where you need to actually build that confidence. So configuration as code. Yeah, it’s great. I almost put up a stock photo here, but I Googled configuration, I Googled, “hopeless dream”, and the first thing was just a picture of someone’s dentures. Felt really uncomfortable doing. Configuration as code is great because we have so many tools to test and build confidence in code in production, but it’s worth understanding that configuration has a unique set of features to it, and some of these features cannot be tested in the same way as code. So strip away as much of the complexity as possible, and you’re left with a much simpler system that you can build effective confidence in because for a lot of configuration changes, there is nothing else other than testing in production.

So try and get to config as code. It’s great. Phase your rollouts in order to prevent downtime, and move everything you can into machine-driven state so you pay attention to the really important config changes, and you’ll be happier testing configuration in production. Thanks.