In May we hosted the Test in Production Meetup at the Microsoft Reactor in San Francisco. This time we wanted to hear how the culture of failure (avoiding failure, recovering from failure, and learning from failure) has an impact on how teams test in production.
Ben Woskow, Engineering Manager at LaunchDarkly, talked about why failures aren’t exclusively negative events. He led a discussion around how teams can make the most of failures, and turn them into valuable learning experiences.
“When you really instill this notion in your team, the results can be quite dramatic because suddenly, individuals are no longer worried that the vulnerabilities and the bugs that they’ve introduced to their code will be exposed and that they’ll be at fault, but rather, if you’re playing as a team and you find these vulnerabilities, and you identify them, then you can resolve them before they ever happen.” – Ben Woskow, Engineering Manager, LaunchDarkly
Watch Ben’s talk to learn more about how failures can offer teams the opportunity to improve their products and gel as a unit. If you’re interested in joining us at a future Meetup, you can sign up here.
Ben Woskow: All right. I’m here to tell you about how failure can improve your team and product. First, who am I? Well, I just got introduced, but my name’s Ben Woskow and for the past three months I’ve been an Engineering Manager at LaunchDarkly, managing all of our SDKs. Prior to that, I spent almost nine years at Atlassian managing most recently the Marketplace Team.
Before I begin, I want to take a quick poll from the audience. How many of you have worked on a product, or a project, that’s experienced some sort of failure? All right. Some of you are in denial I think. I didn’t get everybody raising their hands, but truthfully, we’ve all had some sort of failures. In this particular talk, I want to focus on one particular kind of failure and that would be production incidents. Production incidents can happen whether or not you’re testing in production, but they happen. They’re inevitable. We can imagine that we live… we can imagine that we… I don’t know how to spare bar double-click. Darn.
While we’d love to think that our products and teams are perfect and that we’re living in some sort of paradise, like this, not like the next slide, that’s unfortunately not reflective of reality. Do our developers occasionally make mistakes? Yeah. They’re human. Are we dependent on third party services, the libraries that might have some sort of instability or vulnerabilities? Yeah, absolutely. Even if all of your software, all your dependencies are perfect, is there always a chance that a natural disaster, a power outage, or some other widespread event will knock out your underlying infrastructure making your Cloud application completely inaccessible? Yeah, it’s happened to several of us, I’m sure. Myself included.
Instead of living in some sort of paradise, we could live in this because disasters always happen or can happen and incidents can always happen, we need to be prepared for them and we need to be able to manage them appropriately. With that, I want to introduce you to incidents. Again, everybody has incidents, but let’s say that we’re having a hypothetical incident right now. What’s the first thing that we would do? Anyone?
Audience members: Call everybody.
Ben Woskow: Call everybody. Yeah. You know what? You’re actually pretty close. We might start screaming and panicking and saying, “Oh, my God. What do I do, what do I do?” Hopefully, this phase won’t last very long because we’re not going to get anywhere there, so we’ll stop because we’re good engineers and we remember what we learned at this meet up.
The first question that’s often asked instead is can we roll back? Right? Maybe we deployed a change just now or maybe there’s an application configuration that just got rolled out and that coincided exactly with the start of our incident. Great. Let’s roll back. Unfortunately, as we all know, that’s not always possible, so what do we do if we can’t roll back? Well, how do we roll forward? Let’s implement a fix, let’s get it out there. How long does this take, though? We have to identify the fix, implement it, test it, get it through code review, wait for our continuous integration to complete, deploy it. This takes some time.
In my experience, at both LaunchDarkly and Atlassian, it’s the third question that we often ask. That question is can we flip a feature flag? Super powerful, super quick. If you flip a feature flag, what that allows you to do is your application can immediately recover without having to deploy any changes of any kind. No deployments. Here’s an example of a feature flag on LaunchDarkly. Let’s say that we’re gaining access to multiple features on our application with feature flags. Hopefully, it will come as no surprise to know that LaunchDarkly does exactly this. In fact, we pretty much feature flag everything in our application and why not? It’s best practice. We’re big fans of following feature flag driven development and I think there’s a book back there actually where you can learn more about it if you’d like from our CTO and co-founder.
Here’s an example from our application about how we can turn off a feature on our application. If we, let’s say we’re having an incident where our audit log functionality is causing instability across the rest of our site, all we need to do is go to our feature flag and disable the targeting and boom. That functionality will no longer be available publicly, the rest of the site is stable. Our incident has been resolved. Let’s say instead of it being our own code that’s problematic, maybe there’s a third-party dependency that we’re using that has become unstable. The solution’s actually no different.
Once again, we can go to our feature flag, we can flip targeting to off, and we can instantly resolve our incident. In this case, the feature flag is to manage our connections to an upstream dependency and if we disable the feature flag, then that’s severed and resolved or you might notice that there’s actually a prerequisite. Maybe there’s some sort of, I don’t know, disaster that’s affecting multiple of our dependencies and we want to be able to disconnect all of them at once. Then we can do that here. It’s really powerful.
Now, when you think of feature flags, you might think about just turning things on and off, but that’s actually not all that they’re capable of. Here we have an example that actually saved us in a real-life incident about, I think it was three weeks ago, where one of our customers started hammering us with just an incredible number of requests and our services became unstable. A common answer to that is, well, you should be rate limiting. The way that we do rate limiting at LaunchDarkly, you guessed it, is by feature flagging. Here in this animation, you can see what that feature flag look like. We have a bunch of different variations where you can specify the different rates at which they would be flagged and within each of those you can specify how you want to target.
In this case, we targeted one specific account because it was that one customer that was causing us to be unstable, but we could also target based on HTTP method, we could target based on authentication method, their IP address, however we want and through this method, we were able to resolve our incident within minutes. The longest part was just identifying which customer it was. If we think back to these three questions, can we roll back, how do we roll forward, can we flip a feature flag? They oil boil down to the same root question, which is how do we fully mitigate the customer impact because that’s what’s most important at the end of the day.
One argument against feature flagging could be that we’re not actually solving the problem. That customer out there that was hammering us with requests, they’re still hammering us with requests, we’re just blocking them, but if we disable the feature flag, guess what? We will have another incident and they’ll continue hammering us and we’ll continue being unstable. What a feature flag does for us is it stops the customer impact, so we can then, at our own leisure, we can investigate, identify the root cause, and fix it without our customers having to continue to feel that pain in the meantime.
Our incident has been resolved, great, what’s next? Are we done? No. After an incident, you should always have a post-incident review or PIR. PIRs are a process that allow teams to introspect on what happened, so that they can improve on it for next time. Note that in addition to PIRs, this process could actually happen before the incident happens, so you could have something called the pre-mortem where you go through the same exact steps, the same process as you would for a post-incident review, but before the incident happens, you would identify a potential problem. Then before it happens, you would figure out how to resolve it and then you do so.
Here’s some sample questions that I like to ask during PIRs and the outputs from those questions. The simplest question that you start with is what happened. From this, you get your timeline of events. This is the foundation for everything else that you discuss during your PIR. You have the question of what went wrong? From this, you’ll get the identification of your root and proximal causes. What was the lowest level reason that your incident occurred and what are all the secondary reasons that happened related to that? Why did this happen? It’s a pretty basic question, but from that you’ll get your root cause analysis, a deep level understanding of what happened, so that it can be fully addressed. How did we find out that there was an incident? From that you’ll do your detection analysis. Do we have monitoring and alerting in place? Great, we do, but did that monitoring actually work, did it actually detect the error such that we were alerted?
Lastly, how long did it take for us to recover? From this you’ll do a response and resolution analysis. What part of our process was the fastest or slowest? Is there a part of our process that could be faster, so we could resolve our incidents faster next time? Most importantly, from all these questions come the answers. Most importantly are your improvement actions. When you do your PIRs, your post-incident reviews, it’s most important that you come out with improvement actions because otherwise, it’s just talk. If it’s just talk and no action, you’re not going to improve and you’re going to have the same incident again or maybe a worse incident, so constantly be improving.
With these incidents and with these PIRs, if you’re already doing PIRs, that’s great, but even if you’re not, there’s one other thing I want to talk about because I think it’s really important, which is incident values. Much like how businesses have company values by which they operate, it’s really important to have values when managing your incidents and your post-incident reviews because you want to make sure that you’re handling your incidents responsibly and getting the most value out of them, such that it’s a useful process.
First, I want to call out my former colleagues and Atlassian, there’s a few Atlassian’s here, so hello. You already know these people, but these are based on Atlassian’s incident values from Patrick Hill and Jim Severino. They’ve been generalized for external consumption here. First of all, know before your customers do. This relates to the detection phase of an incident and you should always make sure that you have monitoring and alerting in place, so that you find out about your incidents before your customers do. No customer has ever been happy when they file a support request and realize that they are the first ones to tell the customer that they’re having an incident. Yet, we’ve all been there.
Two, escalate, escalate, escalate. This relates to the response phase of an incident and you should never worry about waking someone up in the middle of the night because the team that you’ve assembled feels like they don’t have the resources they need to resolve an incident. Resolving your incidents are the most important thing and if that’s the case, escalate. Wake someone up in the middle of the night if you need to.
Third, recover quickly. We’ve talked about this a bunch, but often fixing, or even identifying the root cause takes a lot of time, so you should focus instead on resolving the customer impact first and then secondly, resolving the root cause because at the end of the day, your customers mostly care about how long your service is unstable and less about why it was unstable. That’s really important to remember.
Fourth, incidents are always blameless. We’re all on teams. There are no individual players out there. Secondly, incidents are inevitable. They happen to all of us and we all cause them at some point in time or play our part in them. I know I have, so because we all cause them, we have to recognize that incidents are no individual’s fault. When you really instill this notion in your team, the results can be quite dramatic because suddenly, individuals are no longer worried that the vulnerabilities and the bugs that they’ve introduced to their code will be exposed and that they’ll be at fault, but rather, if you’re playing as a team and you find these vulnerabilities, and you identify them, then you can resolve them before they ever happen. When your employees no longer feel vulnerable, they’ll be much more open to sharing about all they know about your systems.
Lastly, never have the same incident twice. This relates to the improvement phase. Find the root cause and fix it. Identify the proximal causes, fix those, too. When you come up with your improvement actions from your post-incident reviews, don’t just come up with action items and then put them in your Jira instance and forget about them. They’re on the back log, we’re good. No, actually assign due dates to those tasks and make sure that they get done. Make sure that you’re continuously improving.
I want to circle back to this image of paradise really quickly because, yeah, we don’t live in paradise. We just got that. Just because we don’t live in this paradise doesn’t mean that we can’t make our world a bit better. If we resolve our incidents quickly through the use of feature flags or some other means, if we go through our PIR process, and if we adhere to our incident values, then we’ll have fewer incidents, our incidents will be resolved faster, and our teams will be working better together. All in all, that will make your customers much happier. I’ve seen all of those things make huge impacts on my teams and I hope they do for you, too. Thank you very much.
Host: Ben, thank you so much. We’d like to open it up to questions. Raise your hand again. We’ll bring you a microphone. Awesome. Kim’s going to come to you. Over here first. Thanks.
Audience member: Oh, over there or here? Okay. I definitely understand the importance of a feature flag, especially from an incident response perspective having calm operations people is definitely always a good idea, but from like a UI perspective, would you still put up a status page if say you remove something from the customer’s UI or how would you communicate that the service didn’t just magically disappear, that it will eventually come back?
Ben Woskow: Yeah, great question. In my first example, my first example incident was where our audit log was misbehaving, so we had to just hide that functionality from the site. I would definitely put that on our status page because while the rest of our site was stable and behaving correctly, one of our pieces of functionality was still missing and that’s still impacting some of our customers, so I would want to let them know. The severity that I would be posting on the status page would probably be different because it would be just one piece of functionality instead of the entire site, but I would still share that with my customers and be transparent.
Host: All right, Ben, over here.
Audience member: Hi. I really liked your presentation and I like how you broke down the different steps. I was just wondering if you have any recommendations, suggestions if multiple incidents happening at the same time and they could or they could not be related to each other? How would you deal with that?
Ben Woskow: Yeah, that sounds like a nightmare. What I would say is have multiple teams looking at… Bring multiple teams onboard, such that each team could look at each incident. Don’t have multiple incidents necessarily being looked at by multiple teams unless you’re pretty sure that they’re related because they might not be, but definitely you would want to make sure that the teams are sharing information with each other. That’s paramount because otherwise you’re going to have them doing the same investigation, looking at Honeycomb at the same pages or trying to look at the same feature flags in LaunchDarkly. I think that it’s probably too much for any individual team to manage on their own, but you would need a larger staff and make sure that they’re all communicating with each other really well.
Host: All right. We have a question over here.
Audience member: My question is, obviously, when you flag everything or you said you flag a lot of things at LaunchDarkly, I imagine that at some point you forget what a certain state can be if you flagged something like two years ago. Can you talk about the process that you guys use whenever you do decide to turn a flag off that was implemented maybe about a year ago and what that process is like, at least following up with the customer to make sure it didn’t introduce like another issue?
Ben Woskow: Yeah, definitely. This meet up is called Test and Production, but that doesn’t mean that you can’t also test in other environments. First things first, when you’re flipping a feature flag, I would strongly encourage you to try to reproduce the problem in staging or some other similar environment. Then flip the feature flag and see if that resolves the issue. Not only does that resolve the issue, but what other parts of the site does that impact as much as possible. That’s the main thing.
In terms of making sure that you don’t have a long list of feature flags that are outdated and irrelevant, there’s some best practices around that. Flag cleanup is almost just as important as flag creation. You want to make sure that when flags are truly no longer relevant, and you know that you will never disable the flag or return to the prior state, then you want to remove that part of your code from your application. For example, if you want a feature flag to remain in place as a kill switch, that’s a great use case to leave in place permanently, but sometimes you’ll release some new functionality or a complete site redesign and its been out for a year, it’s highly unlikely that you’ll be returning to your old site design, so you can probably remove that feature flag at that point.
Host: All right. A question here.
Audience member: The multiple incident thing kind of reminded me of, I had flashbacks to, we have a multiple layer problem. The best practices in sort of the network space where there were multiple parties involved, was to be very good at pointing the finger somewhere else. Right? Like, well we’ve isolated… the problem is not in our part of the solution, so we’re not at fault. I’m wondering if… Fortunately, that was probably a decade ago. I’m wondering what are the cultural ways to reduce the likelihood that people will focus first on, well let’s make sure it’s not our problem, as opposed to, let’s all work together to figure out what’s the real problem and fix it?
Ben Woskow: Yeah, so let’s take an example where we’re working at a company, we have multiple microservices, those microservices are dependent on each other, and it’s some other microservice that all of our products, all of our teams depend on, but it’s still managed within our company. In that case, we can still help investigate and feed information to that team and say, “This is what we’re encountering,” because any information that we can give, that we can provide while working with them, will still be helpful to them. It’s still information that they get and time that they don’t need to spend themselves. If they’re the team that is truly responsible, that 10 years ago we would have been pointing our finger at them saying, “Ha, go fix it,” they should be spending their time identifying the root cause with all this information that we’re providing, and then fixing it.
At the end of the day, you have a common goal of resolving the incident, whether you are the upstream or downstream service that’s affected by the incident, it’s not fun for anybody and you still have a common customer base, so culturally you still want to instill a notion that you’re working together as a team, that you have a common goal, and that your end customers are your most important asset there for both sides.
Audience member: Would you share some of your best practices for flag cleanup? I think Honeycomb loves LaunchDarkly, but we tend to rip flags out as soon as we feel comfortable doing so in order to avoid stale code and stale flags. I’d love to hear what we should be doing instead to be able to use flags as a kill switch?
Ben Woskow: Well, we have something on the roadmap that’s going to make it even better, so I… I’m going to defer that question, which is kind of a cop out, but look for something in LaunchDarkly very soon.
Host: All right, Ben. I think we have time for one more question.
Audience member: This is a question not regarding the presentation itself, but the title of the presentation. How do incidents and failures make a better team or improve the team?
Ben Woskow: It’s all about working together and trusting each other. It’s mostly about how to improve your product, but the part where it improves your team is really about having blameless incidents and working together as a team because when you make a change, you might have been the developer, but someone else on your team probably reviewed the code or someone else on your team probably did the technical design with your or filed the spec or filed the issue in your issue tracker or maybe someone else pushed the deployment button. You’re all collectively at fault, in the same way that when you have a huge success, you’re all collectively successful. When you start thinking about working together as a team, it helps in your incidents, but then it also helps just generally working together in all ways.
Audience member: Having like cultural norms or expectations around on calls and mitigations and post-mortem reviews?
Ben Woskow: That’s part of it, but that’s not all of it. It’s just instilling a culture that no one is at fault, but that it’s also that when you succeed, everyone succeeds, when you fail, everyone fails as opposed to any individual being solely responsible for this.