Feature Management for DevOps

122
Tim Wong talks about feature flags

At our September meeting of Test in Production, our own Tim Wong—LaunchDarkly’s Principle TAM and Chief of Staff—gave a talk about Feature Management for DevOps.

“You can get to a point where you are pushing configuration based on routes or other aspects of your instance. This is kind of poor man’s tracing in some ways or you can degrade functionality based on a particular route or a particular IP or a particular node set, particular method, particular account. You can provide different configuration sets for different types of things as long as you instrument them that way.”

Watch Tim’s talk below to learn more about ways you can manage operational risks with feature flagging. If you’re interested in joining us at a future Meetup, you can sign up here.

TRANSCRIPT:

I’m gonna talk about feature management for DevOps. That’s a bunch of really squishy words added together into a hopefully, coherent sentence maybe. When I wrote this talk, I was one of the Principal TAM’s for LaunchDarkly. I don’t do that anymore, but I’m still gonna give this talk and it’s been converted a bit for this venue. So, hopefully I do it justice. So, who am I? I’ve been an engineer for 10 years. Before I was working at LaunchDarkly, I was working for a company that no one’s ever heard of called Atlassian. I spent a couple of years on the front lines and a lot of fires managed there. Come on, let’s play.

So, I’ve seen some things about how companies work, especially … my role at Atlassian was to kind of manage relationships between our teams and Fortune 100 companies, like people who were using our tools day in, day out and really needed as much availability as they could possibly want. So, what I hope to do in this talk is to talk through some of the new ways or ways that you can manage risk with feature flagging and hopefully stuff that isn’t obvious, hopefully not obvious. And, how this idea of using feature flags for nearly everything fits into the overall DevOps puzzle.

So the current state of feature flagging, it’s usually used by developers to control new customer visible changes or roll back if there’s a bug. That’s just really fancy ways of saying turn it on and turn it off. That’s not new, right? And, it’s not hard understand and bonus points if you actually know what that is in the corner. That being said, the basic functionality of turn it on, turn it off is really freaking valuable. The idea that you can solve something like 90% of the possible failure modes shipped by toggling a flag that you can do in under a second, that’s great, that’s a dream, right? It’s pretty sweet, but there’s a ton of problems that you can’t actually solve that way.

Like, for example, there’s some things that you just need to redeploy for, like for example, configuration, you can’t … sometimes you can put your flags up, but not always or if you’re changing database schemas or changing the state of the world, you just can’t. And you taint your state, you taint whatever’s out there and the only way to recover from that type of problem is like here be dragons, right? It’s restore things and write new code and migration scripts and who knows, right? But, where we should be and where we think we should be, is that you can do more with flagging than turn it on, turn it off.

You can hopefully mitigate risk around some of those more difficult complex changes and do some operational things with feature flagging such that you can ensure availability or some proactive changes. And, I’m gonna try to talk through some of them. I have this whole section about what feature flagging is. I kind of hope everyone knows. I’m gonna blow through this. It’s a fancy if statement. I’m gonna skip these slides ’cause they’re really not that important, but basically, a feature flag is allowing you to choose your code path from outside of the runtime. Such that if you can turn the … do that top level NewFeatureEnabled thing and control that from outside the application you can get new behavior or old behavior, your choice, right? Without redeploy.

Anyway, I have even more builds in here that I forgot about, but it’s context-sensitive, dynamic configuration from outside the application. And it’s a way for you to separate what you deployed out there and when it’s available. Meaning that deploys are not free, they take time and if you ship something, that takes a little while to manifest and if you got 10,000 or 20,000 servers you can’t redeploy. I mean you can, but it’s gonna take a while, it’s not instant and meanwhile, you’re in this weird state. But, with feature flags, you can go to line speed or however fast your system is, right? To show that change. Deploying becomes just simply the act of putting something out there, whatever that is in your environment, and releasing is when you decide it to be.

So, I’m gonna talk about one of the ones that is kind of hairy, database migration and how do you do that. Well, I’m gonna come back to this slide because this one of those things that’s like that two percent revert rollback, here be dragons thing. Like, good luck, right? Database migration, you need to change the schema, that’s not easy. So, how do you do it? Here’s the most basic way, you declare a maintenance window, you take down your app and you shut down. And, so you don’t taint your whatever it is, whatever the database is, and then, you do your thing and hope it worked. And, you turn it on and you pray, right? That’s how it works. How can you do this better?

Okay, well, we’re gonna get real complicated now. Here’s a feature flag way of doing it. You have two databases, you spin up another one. You have your old database and your new database, whatever it is, maybe it’s Mongo, maybe it’s … I don’t know, it doesn’t matter. Right? You push new application code out there that knows how to do dual writes and you have a flag controlling it. So, while that’s out there, you write a percentage of your data to your new data store. You gotta make sure that it can deal with the deluge of stuff you’re gonna put in it. What if it has some configuration problem that you didn’t foresee and it falls over? Well, so what? Fine, it falls over, that’s great. You have a new data point, no one’s hurt. You’re still reading from your old database.

As you get more confidence, you roll it up to a 100%, great. Then you start reading the percentage back, right? And, this is another time that you get more data points about the robustness of your system, whether you can get enough throughput from your new system to satisfy the needs of the platform. You can then compare that data set that you’re getting out of your new database with the same old one that you’re getting live. Right? And then, you can make sure that you’re getting corruption across your data set and eventually you get to a hundred percent. At this point, you can backfill.

You didn’t have to come up with the new way of or how to migrate your old data to the new data, you’re running a forward log of every transaction that’s in your system and you can backfill form your old database. And then, as you get more confident, you can roll it back. Nowhere in here do you have to do reconciliation, you don’t have to … if you find a problem in your data schema, you had many, many, many, many, many chances to catch it. Right? And, you don’t have to … you can move that two percent here be dragons into that 90% I just flip a flag.

We have a full write-up on this on our blog, so you can kind of breeze through that, but this was actually done for us. We migrated between MongoDB and DynamoDB. We have a full write-up as well as test code for you to play with. Something that has actually been battle-tested, we’ve done it. The next thing I wanna talk about is safety valves. And, this is a lot like circuit breakers, but not quite. It’s a long-term flag that can be used to degrade non-critical functionality to maintain availability. That’s just a lot of words. I’m gonna try to walk you through. So, for example, in our platform, we rely on all sorts of downstream providers like people that we buy SaaS services from and sometimes they go down, not to poke fun at any of them, it’s just something that happens and if your business is dependent on their business and they go down, you might go down too.

I mean, that’s just a fact of life. So, you can flag around that. You can say, “Hey, render that whatever element that we’re relying on them for or not. Degrade gracefully.” This is kind of like a circuit breaker pattern. But, what if you are utilizing some provider to provide functionality? Well, you can feature flag which provider you’re using, so perhaps you’re using Mailgun and maybe you’re not feeling that great about Mailgun anymore because they’ve had availability concerns. There’s no reason why you can’t pivot and move to a different system without having to implement it and push it. Right? You can do this in the back and you can build a switchover in your code, such that you can fall over gracefully to backup providers and then, control that behavior across your code.

Suppose that you have a service that sometimes gets degraded. Anyone who’s run Elasticsearch knows that it’s kind of garbage, but it’s great until it’s not. What if someone pushes some incredibly huge data set or some weird … you know something you didn’t come up with and your Elasticsearch starts failing? Well, I mean, your old options used to be, well, quickly push configuration out to the cluster or quickly add nodes to the cluster to bail it out. But, if it’s one bad actor, if you know that like, “Okay, we think it’s this one customer or this one data source that is impacting Elasticsearch,” then maybe you can just degrade that service gracefully. Maybe for that one customer you say, “Okay, we’re gonna return a 404 for you and you only and no one else. Everyone else is protected.” And, this is how it’s different than a circuit breaker.

A circuit breaker is all on or all off, here you can decide and make a decision about how you want your service to degrade. When using these things you have to, have to, have to understand exactly how they work and also, remember to turn it back on when you fix it. The last thing I want to talk about is dynamic configurations, this is a lot more big words, but what I mean by this is this is very similar to the talk previously. How do you know your configuration is good? You may make some guesses based on your platform, based on your knowledge about how your platform behaves. Like, “Well, we want 500 millisecond long requests, they’re garbage anyway. We can’t meet or escalate with that.” Sometimes you know and sometimes you don’t. Right?

Sometimes you decide your assumption about those requests is wrong. Well, you don’t necessary have to accept that as set in stone and what I mean by this, and I’ll give a few concrete examples here, is how about caches? Caches are kind of black heart in terms of tuning. How do you do it? What’s the correct cache ETL? How long do you hold onto things? Maybe it’s too long, maybe you’re running out of memory, maybe … who knows, right? And a lot of the times you’re deploying these things kind of carefully, you’re monitoring the heck out of it to make sure that you’re not impacting your overall CPU usage or your memory usage based on the type of system you’re running.

You can push, you can store your configuration in a feature flag and decide what value you should put out there. You may start with the value that’s out there in the wild right now and then, roll, as you get understanding, to a new configuration. More differently, you can get to a point where you are pushing configuration based on routes or other aspects of your instance. This is kind of poor man’s tracing in some ways or you can degrade functionality based on a particular route or a particular IP or a particular node set, particular method, particular account. You can provide different configuration sets for different types of things as long as you instrument them that way.

Now, how you do that in your platforms, this is going to be an adventure, understanding how you might do this, that’s gonna be up to you. But, it’s something that you can do with feature flags without redeploying. So, hopefully, as I spent the last 10 minutes up here, I have accomplished my goal of helping you think about … hopefully, these were not completely obvious like, “Well, of course, that’s just on two,” ways of using feature flags and possibly a different way of fitting it into your DevOps puzzle, however you decide to do that.