Using Feature Flags to Avoid Downtime During Migrations

Each year, AWS re:Invent delivers some amazing talks that can lead to big process changes. In 2021, Mike Zorn, an engineer at LaunchDarkly, delivered this talk on reducing downtime during major migrations. In 2022, he delivered a refreshed talk at LaunchDarkly's Trajectory conference on the same topic that we've added at the bottom of this post.

Optimizing efforts for the cloud means continuously migrating and rearchitecting. These infrastructure changes can be difficult and filled with unknowns, which can make migration painful. Maintenance windows aren’t an option because testing happens in production. What if there were a way to migrate safely, while mitigating risks, and headaches?

Back in November 2021, LaunchDarkly’s own Mike Zorn and Justin Caballero traveled to AWS re:Invent 2021 in Las Vegas to provide a live session on techniques for painless reinvention. At re: Invent, they demonstrated how to convert a streaming event architecture and migrate production databases with zero downtime using gradual, reversible, and verifiable processes controlled through feature flags.

Below is a quick recap, followed by the full transcript, of what was covered in Mike's talk, which focused on how LaunchDarkly handled a recent migration of its own. We'll soon have a recap of Justin's talk, about how to pull off a seamless database migration in an application using LaunchDarkly—without your customers even noticing. If you’d rather watch both presentations in full (which we strongly recommend), you can do so by clicking here or watching below.

First, a proper introduction

Mike kicked off his talk with a clever analogy using the repair of the San Francisco-Oakland Bay Bridge as an example of downtime during a migration.

He then summarized how feature flags work for any in attendance who may be unfamiliar:

“In principle, these are just a tool that lets you deploy your code when you want and then release it when you're ready—the idea of decoupling your deployment from your release. So, you can ship an unfinished feature to production, test it out in your test account, and if it all looks good, turn it on for your user base.”

Of course, a demonstration was in order, so Mike walked the audience through a simple example of feature flag usage in JavaScript. He showed how one could change the runtime behavior of an application without any other modification while also showcasing use cases of what low latency updates can enable.

Event processing data in the early days of LaunchDarkly

Once Mike had demonstrated simple feature flag usage with LaunchDarkly, it was time to show how the team facilitated the migration of a very high throughput data processing system.

Because everyone loves a good visual, Mike displayed an extremely primitive stick figure drawing that represented what LaunchDarkly’s event processing data looked like around 2017. With this infrastructure, you would send data about a feature flag evaluation back to our events API, which has a load balancer and a web application behind it. The web application would then fan that data out to many different databases that powered several features in our application.

However, this architecture struggled as the application grew because memory demand exceeded what the architecture could support. (The full video explains what exactly was going wrong and why.)

So, we needed a new architecture that would minimize data loss.

A (somewhat) simple solution

Mike then outlined the solution we used, which was to put a Kinesis queue inside of our web application. Then we moved our data fan-out into AWS Lambda and did the processing in Lambda instead of this web application.

For this to work, we had to transition from the old architecture to the new one. So we added these two new pieces: sending the data to the Kinesis queue and processing the data in Lambda instead. We then created two different feature flags for those two pieces to allow us to do these migrations separately. However, this created its own set of problems that needed to be addressed before rolling it all out (which Mike also details thoroughly in the video).

Rolling out the new architecture

Once we had smoothed everything out and were ready for the rollout, we could then turn on the new architecture for maybe 30% more users, view the new graph that would show up in our monitoring tool, realize that we had uncovered a new bottleneck in the new data processing system, and then switch the users back to the old way of being processed. Then, we would figure out why that graph went up.

We repeated this process often. And although we had this system that couldn't quite handle that load, we were able to roll it back without anything bad happening when needed. And this was the perfect moment for Mike to bring back the San Francisco-Oakland Bay Bridge repair analogy, full circle.

Again, this was a broad recap of Mike’s full talk on using feature flags to safely migrate our streaming event ingestion pipeline. To hear all the details, you can (and should!) watch the full video here. The second half of this talk features Justin Caballero’s presentation on how to pull off a seamless database migration in an application using LaunchDarkly—without your customers even noticing.

Again, you can view the full video with both presenters: "Using Feature Flags to Avoid Downtime During Migrations at AWS re:Invent 2021 by clicking here."

–

Transcript:

Hello, my name's Mike and I'm here with Justin. We're both engineers at LaunchDarkly, and we're going to tell you about some examples of using feature flags to avoid downtime during migrations. So, before we talk about those examples, let's just talk about a really complicated migration that we can all maybe relate to.

Here's a picture of the San Francisco-Oakland Bay Bridge. If you're from the Bay Area, you've probably seen it. It was damaged in the 1989 earthquake and it needed to be replaced after it was damaged. And it took 11 years to do that. Sometimes you build something and you are like, "Oh, we'll never have to rebuild that." And this, I think, is a great counter example. It was a kind of Wonder of the World-scale project at the time it was built. It actually consumed 10% of a year's worth of steel production for the entire country, so it was a pretty massive thing. And even this massive thing needed to be rebuilt. When it was rebuilt, they actually had to do a migration. This new bridge was built, and then you had to get the cars onto the new bridge. This question is, "How'd they get the cars over there? All the cars were going on the old bridge." Well, the answer is that they closed the bridge for 5 days, they had to reconfigure the lanes during those 5 days, and then reopen the bridge.

So, as a software engineer, do you remember the last time you took down critical infrastructure for 5 days? Probably, the answer is no. And the reason for that is we have different SLAs as software engineers. If you work for a SaaS company, you probably have an SLA of three or four nine's. Four nine's is 52 minutes of downtime a year. But, on the other hand, we're probably not building systems that are operating just fine for 50 years because cloud infrastructure changes pretty fast. There's a new EC2 instance site, normally, every couple of years. So, those factors create this scenario: where software engineers are migrating more often and we have less tolerance for unavailability than a civil engineer would. But the good news is that we have a lot of advantages and tools that somebody that's building a bridge doesn't have.

One of those tools is feature flags. This talk, we're going to talk about some examples of using this one tool, feature flags, to make migrations a lot easier. We just went through the introduction. We're going to talk briefly about what feature flags are, and then we're going to go through two examples of some larger migrations we've undertaken at LaunchDarkly. One is a high throughput data processing system that I'll tell you about, and then another is a migration of our core data store that Justin's going to talk about.

But first, feature flags. In principle, these are just a tool that lets you deploy your code when you want and then release it when you're ready. So the idea of decoupling your deployment from your release. So you can ship an unfinished feature to production, test it out in your test account, and if it all looks good, turn it on for your user base. So, what does that look like?

Here's a very simple example in JavaScript. We just have an if statement, we have our feature flag here, my cool flag, and if my cool flag gets turned on, then our application will log out, very cool. This is actually a little more powerful than you would think. In addition to decoupling deployment for release, it actually is changing the runtime behavior of our application without any other modification, which is an interesting property. And there's a second property of feature flags. If you're using a product like LaunchDarkly, the updates are actually very low latency. So if you click the feature toggle button in the LaunchDarkly web UI, within much less than a second, your change will be affected on your servers or your mobile clients, what have you. If you're just doing a simple on/off, that's kind of overkill, probably. If you're just releasing a new feature, you probably don't need it released in the same second, everywhere in the world.

But, this low latency enables some additional use cases. One is the ability to, if something is not working, you can turn it off, and you want that to happen as fast as possible. Another thing we've done internally is to use feature flags to change your monitoring fidelity. So if a customer is seeing a higher error rate, you can turn the feature flag on to enhance the login for that customer and get additional information without blowing your monitoring vendor budget. And the use case we're going to talk about today is migrations.

We went through the introduction, what feature flags are, and now we're going to get into our first migration: the migration of a very high throughput data processing system that we run at LaunchDarkly. Here is the LaunchDarkly event processing infrastructure circa 2017. If you evaluated a feature flag on, say, a phone, you would send some data about that feature flag evaluation back to our events API, which has got a load balancer and then a web application behind it. And that web application would then fan that data out to a bunch of different databases that powered a bunch of different features in our application.

This architecture was fine for a small scrappy startup, but caused a big problem as the application grew, which is that every time you added one of these new data stores or whenever you added a new feature that was powered by the event data, you would have to add a new database and that database would create a new failure mode for the whole events API, because there's no back pressure in this design. If writes to one of these databases slowed down, then that means that writes just kind of queue up in memory of this web application, and once you run out of memory, everything crashes, and the on-call person gets paged. We realized we needed a new architecture. We wanted one that would minimize data loss. We don't want an application crashing, causing us to lose several gigabytes of data, and we wanted it to be scalable to more features. We wanted to stop having to add a new failure mode that could cascade to everything else every time we wanted to create a new feature.

Here is what we ended up with. Pretty simple. We just put a Kinesis queue inside of our web application. The idea here was you're still sending the events to our load balancer, it sends it to the web app, but now the web app, instead of sending it out to all these different databases, we just put it into a Kinesis queue. Then we moved our data fan out stuff into AWS Lambda and just did the processing in Lambda instead of this web application.

It worked great. But before it worked great, we had to go from the old architecture to the new architecture. So we added these two new pieces. We were sending the data to the Kinesis queue and we are now processing the data in Lambda instead. So we created two different feature flags for those two pieces, and that would allow us to do these migrations separately. We could first start writing all the data to Kinesis and test that, make sure that our mechanism to batch the rights into Kinesis was operating well before we started doing processing, based on those Kinesis rights, so we could test that stuff independently.

But that created a bit of a problem, that I'll talk you through now. So we have those two flags and they're evaluated at two different times. One when you write to Kinesis queue, and one when the data is processed. This creates a bit of a problem if you do it that way, where, what if you made a change to the flag state in between those two events or in between those two evaluations for a single payload of event data. If that happened, your data would end up just getting lost because you would put it into the queue, and then when you try to take it out, you say, "Oh, okay, I don't need to do this work." That is not a great scenario.

So if you need to do a migration like this and you want to use feature flags, what you need to do is basically put the feature flag evaluations into the same spot in your code, and then attach the second flag evaluation that controls the data processing as metadata in the event. This way, you do both your flag evaluations at one instant in time. And even though this person got removed from the new style of data processing, it's still processed in the new way because it wasn't going to be processed in the old way because of when the flag evaluation happened. So we're kind of just serializing the read.

So now we have this mechanism to evaluate flags so that we process data consistently in our old system and our new system, we're ready to undertake this migration. Here is a typical day of rolling out this new architecture. We would turn on the new architecture for maybe 30% more users, and then we would see a new graph go up into the right in our monitoring tool, and realize that we had uncovered a new bottleneck in the new data processing system, and then we would switch the users back to the old way of being processed. Then, we would need to figure out why that graph went up.

This was a process we repeated often. But what was really cool here is we had this system that couldn't quite handle that load, and we were able to roll it back without anything bad happening. You can contrast this with the bridge example. If you were on a bridge and 30% of the people go on to the new bridge and then say, "Oh wait, actually don't go onto the new bridge. It would just be chaos." The feature flags are this perfect way to route traffic without having a real overhead. And what was really nice about this is it gave us some really good properties to this migration. It allowed us to avoid data loss.

But the real big thing was that we were able to deliver incremental value sooner. One approach we could have taken here would be to do an exhaustive load test and figure everything out up front. That's how you would want to build a bridge. But for us, it actually is better to have some people using the new system, even when it's not scalable enough or not scaled enough to process all the load, because the customers who were able to move onto the new system get the additional value of the durability.

Because one of the things here was that we were making newer features that required more durability. We had some things where, if you lost, if the data was missing, when it came out the other end, people would notice and that would be bad. So, customers where those features were more important, we were able to migrate them onto the new durable system sooner.

And we were able to make this an iterative process. Instead of having to figure out all of these problems at once, we were able to come across these edge cases as we saw them. Sometimes some customer would be sending us some kind of surprising data and then we would take them out and still keep processing all the data for everybody else in the new system. So, in those ways we were able to iterate through our problems while we did this migration.

That was our high throughput data processing system. I'm going to hand it over to Justin now. He can tell you about how we've been migrating our core datastore.