Planning for the Outages You’re Going to Cause

230
Ramin Khatibi speaking at the Test in Production Meetup

Ways to shrink the blast radius of failures in your software

At the November Test in Production Meetup in San Francisco, Ramin Khatibi, a Site Reliability Engineer (SRE) and infrastructure consultant, laid out strategies software teams can employ to minimize the impact of failure.

Watch Ramin’s full talk to learn the benefits of releasing small changes often, preemptively looking for failure, communicating with your organization about the changes your team is making, and more. If you’re interested in joining us at a future Test in Production Meetup, you can sign up here.

FULL TRANSCRIPT:

Ramin Khatibi: I’m Ramin, I’ve worked on 24/7 systems since about 1996, been a mix of everything from small startups to ISP to web apps, video streaming and things with billions of pages. And I’m really only kind of say it to show that I’ve actually broken all of that at some point. Pretty much caused an outage everywhere. Sometimes they’re very small and sometimes you know, they end up on CNN. That’s a little bit freaky when that sort of stuff happens. There’s always a lot of conspiracy theories about how things broke and the reality is we screwed up most of the time. So, fun stuff. And really, in that case, it was the, “Hey, our positive test worked or negative test we forgot to run.” It turns out most of the production was the negative test and there you go. So, and I changed the name of this depart one out of part three because I was going to do before outages and after outages and during outages.

But really, we just have time for before the outages and I wanted to start off on a good note that you know, I’m fallible and often overbearing. So, I found this cool poem which really talks about kind of talking down to people and thinking, you know, just because you know a couple of things you might know the secret to everything that is not true. Your system is your system and the way you run it needs to be a decision you make. Please don’t go back to work tomorrow and go, “This guy stood on the stage and told me things.” I will certainly tell you the things, but whether they’re valid or not is a good question.

All right, so what do you call it? This talk is preparing for the outages you’re going to cause because you’re going to cause outages. It’s pretty much impossible to have an outage free career. I’d say absolutely impossible. Things like Chaos Theory, Second Law of Thermodynamics will tell you, you just can’t win. The other thing is though, there’s two names on here at the bottom, a Sidney Dekker and Erik Hollnagel. They pretty much wrote the book literally on a safety, human factors in safety, and resilience engineering. If you want to know more about this, these are definite people, I would go read their books. Sidney Dekker is a little bit more easy to kind of get in like wrap your head around. Erik tends to be a little bit more academic. I think is a good way to put it. Sidney Dekker wrote the Just Culture and what do you call it Drift into Failure. The both of those are fairly easy read. You can kind of read a chapter at a time and get quite a bit out of them.

The second thing I wanted to do was pull in yet another book. This one is Thinking in Systems by Donella Meadows. I see some people so well done. I this really changed the way I think about systems engineering. And what I really liked about what she said was you can’t tell the system what to do. You can really only enhance the flows that kind of get you to the goal you want. And I thought this was a good way cause like I can’t go into work tomorrow and announce we will have no outages. It just doesn’t work. So, you’ve got to build systems that kind of and flows that help you get to reasonable outcomes. And in this case, we can’t stop them, but we can perhaps starve them of resources. And I think that’s the way to think about it.

Okay. Boundary conditions because you really, I don’t think certain outages are interesting. If you do a release and cause an outage that’s interesting. If you didn’t update your SSL cert, and cause an outage that’s non-interesting. So, one of these is like the outage of inaction, and that’s forgetting to do your maintenance or not making reasonable plans. Your outages of action I think are like where we’ve got like that’s our interesting thing. And we can like think about that in a way that lets us move forward. The other thing is like when you break something through action, you pretty much know what caused it. You’re like I did a thing, I changed those two lines. My site broke. I bet I know where the problem is. When you have these outages of inaction and you’re like we didn’t do something for six months, and now the site broke. I don’t even know where to start. I mean it could be anything. So, and really at that point all you can do is go, all right, call the person who’s been here the longest but the most context and hopefully they know what that thing is.

And that’s a terrible thing to do in this economy. Like we’re going to hire super-duper 20-year engineer. They know stuff. Hopefully the random stuff we need to hopefully fix this problem today. And that’s why really we want systems that are, you’re having a conversation with, you’re doing a lot of releases and you’re breaking that system down into like better components, better separation, nice interfaces, things where you don’t need somebody with a crap ton of context to come in and go. It was that thing that’s really like as cool as it is sometimes to be a person who was like, “I know what that is.” It’s not the way your team wants to run a system.

So, the other thing is it really depends on where in the stack you are. Like the further down the stack you are, the larger your outage is going to be. And being terrible network engineer in the 90s that’s honestly why I don’t do it anymore. Also, the tools were terrible, but I do work in infrastructure and when I screw up, I can break entire sites for like a good two hours. So, and if you’re on the front end you can certainly break the site, but you’re a lot easier to recover from. If you break DNS you might break, you know all the Hadoop and all the CDN and all of the sites. It’s a lot of fun. So, it is something to be aware of. Like you’re depending on where the stack is, you may operate differently, you may assess risk differently and that’s okay. Just be aware of like what your blast radius is.

Right. Actually, I’m going to make this smaller. I can read it. All right, what do you call it? So, the best way to get good at releases and not have big outages in your releases is to do releases. But what do you call it? You want to be a little bit smart at it. So, in high school, I was a terrible gymnast because I thought like for certain strength moves, the more you did them, eventually, they would just work because you’d get it right. The reality is somebody needed to lift weights for six months and then you could certainly do a lot of these strengths moves. But like nobody really explained this to 14-year old me and I was kind of an idiot. So, releases are like that too. If you can do releases, if you do them over and over again, you get better, but you actually need some coaching.

You need to sit down and go, where are our weaknesses? So, and really going through that cycle, kind of the virtuous cycle of releasing, deciding what didn’t work, what did work is how you get good at it. The other thing is I like releases. I like releasing every day. Probably the best I ever did. I was, I did 30 releases to production a day for like a month. I was trying to move to a new system and that was just the way forward. And I see companies talk about managing risk, but really a lot of times that’s code for avoiding risk or as I like to call it stockpiling risk to a later when somebody needs to like actually go get something done. Things like blocking releases, Friday moratoriums artificial Gates on releases. I think all of these are terrible ideas. Depending on your team, your team knows how to release software or like what the how to gate that.

I see people try and start committees for releases when there are problems. And you know it’s cool like all right, I’m going to do a, I’m going to proxy all our outgoing traffic through CGNET. What does that mean front end developer and or backend developer or DBA or anybody who’s not a network engineer? And like all of a sudden, you’re trying to get sign off by teams with no expertise. And I don’t think there’s any value in that. However, there is value in trusting the teams. So even if sometimes the committees are okay, if they’re just like, “Do you have a plan? Is it written down? Is it automated? If I click this button, does it go? If I click this button, does it reverse?” That kind of stuff’s good, but generally like the, I don’t believe this work should be done is not I think something useful.

All right. The other thing is releases exercise your mental model like you’re going to do release, you’ve probably got a hypothesis on something that’ll make the system better. You agree to send a production, you didn’t make the system better or it does not or it does nothing, but you’re kind of having this conversation with your mental model of, I saw this, I wrote this, I released it, this happened. And the more releases you do, the higher resolution. Like I kind of view it as sonar of like, okay, I’m sending out lots of pings – I have a good idea of how the system is.

People talk about haunted parts of the system where like nobody has released in years. And like, “Oh don’t touch that and nobody knows how it works anymore.” Versus like things where like in your main pass you’re like constantly hitting on that. You’re like everybody knows how that thing works. We released that 50 times this week. And the other thing is like, like your releases, because you’re doing so many, they tend to get smaller and high resolution have smaller pixels. And this is just a better analogy than I had totally planned so, run with it.

And that kind of brings us to small batch sizes. A small batch size, it’s a really the secret weapon. If you’re trying to do big, large releases, there’s just, there’s honestly no way to be agile that way like you’ve got a ton of stuff, it goes and people sort of understand it. If I have to review your thousand-line change, I’m going to go, “Are you going to sit there while it goes great, fine.” I mean like I’ll read it, but it’s like, can I actually get through that in less than three days and get my own work done? Probably not. If you have a 50-line change, I’m like, problem, problem, problem, problem. That’s awesome. Cool. Like fix these, you’re good to go. And that’s really, it’s just less human resources.

The smaller your changes are and it’s easier to reason about. It’s easier to test like easier run through your CI/CD pipeline. Small changes tend to be easy to reverse. Like though, you know like you can’t always go back. You can only fall forward to a new state depending on what you’re doing, but at least like you’ve broken the problem down into something reasonably sized.

All right. The other thing is engineering is not just for features. They’re for actually getting your release out the door, which is a feature. I’ve seen lots of teams go, “We don’t have time to fix this. Can our SRE or somebody come in and figure out how our software works? Figure out how it interacts with the system? And then fix all that?” And the answer is really as a SRE, I will product manage that. I will tell you where your problems are, and I will put tickets in your sprint. And I will come back next week and go where are these? And if there aren’t there, then like I will go talk to your product manager and go, “Cool, what features are you dropping? Because your feature of releasing your software safely is, is missing and I’m not going to get involved with your software if you’re not invested in it.”

And I, I’ve seen teams do really well here, like an eight-person team. Like you’re on call for a week, you do release related work, you do clean up and, it actually like in over time you fix a lot of the problems. And you’re like okay now I can like maybe I’ll schedule half my week in like a feature tickets. But like you know, sometimes you have a bad week and you’re like, “Oh new system is causing us problems.” You’ve already got this space in your sprint to take on problems like this.

So, we’ve got, got the a do a bunch of releases, do small releases, spend time making our releases good. You need to tell the org why you’re going to do a bunch of releases. Particularly if your org is not moving at this. Like if you’re a high volume, small change team, other teams that release once a week are going to go “What the hell are you doing?” Like I’ve literally had this conversation where they’re like, “Why did you do 200 releases this week?” And you’re like, “Well that’s how we work works pretty well.” And they’re like, “Well we don’t know why you’re doing this. So, like why? Why would anybody do this?” And you’re like, “Well I fixed the line and I pushed that out. And I fixed another line and I pushed that out.” And they’re like, “But what are you doing?” So, it’s actually really important to like to think about press releases of like, this is our roadmap.

These are why we’re doing things. Like I was certainly wouldn’t get down to like, I’m fixing Regex matching and our SIS log today, blah. Like nobody cares. But you’re like, “Hey, I’m making sure the Splunk or whatever works on all our machines and this is like a whole blah, blah, blah number of tickets that’ll be done this month.” People are like, “Oh yeah, you’re, you’re making things better.” It kind of gives you this cover. Particularly in an org where you might be a fast-moving team, and other teams are not. What do you call it? Also, like when you go into a post-mortem, people are like, “Oh yeah, that thing was important. Let’s go through the post-mortem.” I’m like, people get weird when they don’t think your work is important. The post-mortem tends to go poorly in my experience and that’s, I don’t like it. I don’t think that’s healthy for the org, but it can happen.

All right. The other thing is, let’s see. Okay, so we’ve press release. We’re going to look for failure before we release. This is pretty much like the actual conversation I’d start with my team. I’m like, I’m getting a code review or we’re just kind of reviewing a ticket. Of my current goal is “thing” because that tells them where you’re going. Like I want to get to here and sometimes that’s the wrong thing. You know like okay you shouldn’t be going there should be going here. So, you’re already looking for correction on direction. Then you’re saying my change to get here is this, and now they can go like, “Cool, I understand your goal.” Your change will or won’t work. I’m like you kind of bringing them along with your thinking of like goal, how I’m going to get there. Then you move on to “Here’s where I think my problems are going to be.”

So, this is the work I’ve done to make sure there are no problems. And then you’ve kind of brought them along and you’re thinking, because like if you just throw work over the fence, even to somebody on your team that has a lot of contexts, they’re not always going to get it with you. The other thing is like, “what am I missing?” You’re inviting like, “Hey, I’ve thought about failure. I’m pretty sure there’s some failure in here. What is it?” I mean you’re not going, “My thing is perfect. Just say yes.” Like if you don’t ask for this and kind of a way that goes, “No, no really I want, I want you to think with me,” it really brings your team along with you.

And last thing is do we have signal? And I will say I have worked on teams where you’ve like released something to canary and people are like, “well we’re going to wait three days.” What? Cool, how do we know it worked after three days? How do we know it didn’t work? And they’re like, “Well nobody will complain.” Eh, like maybe something a little bit more positive. Like I want to point to a thing. Literally, there are times, particularly in infrastructure, we don’t have good insight the way out to your edges. Like this may be the only thing you can do. I don’t recommend it. But having both positive and negative signals; however, you need to get there. And if that’s “Hey, we’re going to go to a reps number of machines and like check them by hand.” It’s something you can do. It’s, it’s not a fun thing, but it certainly we are going to get signal because we don’t know how to get it any other way.

All the observability stuff, people have given tons of talks here. Go get as much observability check, you know, have like, okay, we’ll see this, we won’t see this, this graph will get better, this graph will get worse. Whatever it takes to get there, like have signal.

So, in summary, there’s like six things there – small releases, lots of releases, signal. It just takes time to build these processes, and kind of build these trusts between teams that you have a process. And it takes time to treat your software as releasable. Like I’ve seen teams that they had like this organic Byzantine config file, and it made their automation incredibly complex, that rarely worked, and they owned that code. You’re like, well maybe we should just change that to something we can automate. And, and then just kind of like, “Well we can’t do that.” You’re like, “Well you own the code. Change it to something that’s releasable and it.” Like, sometimes that’s upstream and it’s really hard, but like open those tickets with upstream. I’ve seen, I’ve seen teams like Redis like go fix their configurator and like they got it out in like two weeks after like complaining to the guy for six months. Like just make configures.

But like these things can happen if you push for them. You really have to engineer for them. And teamwork, that’s the biggest thing for me. I was a small team of one to two people for years and years, like an eight-person team is excellent because there’s always somebody who’s slightly more of an expert than you in different spots and they will see the things you didn’t. So yeah, go forth, do that work and hopefully you’ll break your, the outages you will have very small blast radiuses and you will have your least worst outage. So, it’s pretty much it.

Matt DeLaney
Matt is a Product Marketing Manager at LaunchDarkly. He's written about emerging technologies for various SaaS companies over the last several years. Matt also worked in sales at Tableau. He holds a BA in History from Cal Poly, San Luis Obispo. Matt has become noticeably less interesting with age.