Building resilient systems in the face of inevitable failure
Heidi Waterhouse, a Senior Developer Advocate at LaunchDarkly, spoke at last month’s Test in Production Meetup in Berlin, Germany on how to build software in such a way so as to avoid disaster.
Failure is inevitable. All our systems fail, everything is flawed, humanity is doomed to imperfection. But disasters are not inevitable. Build your systems for flawed success.Click to tweet
Watch Heidi’s full talk to learn three keys for controlling failure when building software. If you’re interested in joining us at a future Meetup, you can sign up here.
Tom Fashola: So, our first talk is going to be held by Heidi, and Heidi works for LaunchDarkly, and it’s going to be called How to Lose a Launch. Play on words there. So, Heidi, please come to the stage?
Heid Waterhouse: Thanks Tom, and good evening, Berlin. It’s nice to be here, and thank you to Contentful for hosting us. We need the remote. All right, so let’s talk about how to lose a launch. You’re here because you would like to know how to really screw up your launch and make your software go poorly, and it would be great if your company also looked kind of dumb. That’s what you’re here for, right? That’s why we come to meetups.
Heidi: So, I got you. I’m on your side. We’re going to do this together. I know how I can help you. What you’re really looking for is a dumpster fire of release, the kind of thing where you’re like, “Help, help, let’s put it back the way it was, oh God, we can’t.” Right? We’ve all had that day, and that day is often on a Friday, because God hates us. So, dumpster fire. So the first thing you want to do … hey. If you’re going to make a launch that really fails is make it really, really big. What you’re looking for is something massive, like the Titanic. It’s unsinkable, it’s not going to go wrong, we’re going to set a speed record across the North Atlantic, we’re going to put a bunch of people on it. It’s going to be great. Big launches fail best.
Heidi: All right? The next thing is you want to make it all one piece. If there are any little parts that could succeed, you want to pull those out and just make sure that they’re all rolled in together. Everything should be a tarball, right? Yeah. You’ve done that release too, like a scanner/printer/fax machine. Whose parents bought one of these? Yeah, right? And then they’re like, “My scanner doesn’t work anymore.” I’m like, “Yeah?” “But the printer still works fine, so I’m not going to get rid of it. It’s just going to be the printer/fax machine/not scanner.” So once you’ve made something all one piece, any failure is a failure of the whole.
Heidi: Make it irrevocable or horrible to reverse. Make it so that you can never undo this. Make it a database, right, that you don’t keep any of the old data for … yeah, yeah, I see those winces. You know that face. Yeah. You know that one tattoo? That one that you got in college when you were drunk? That kind of thing. Or, make it glitter. I’ve been told that people do not fly out of Berlin on Monday morning because that’s when all the party people leave, and the plane is just aerosolized glitter and drunkenness. I can neither confirm nor deny that I’ve ever experienced these things, but glitter, I will speak as a parent, is forever.
Heidi: So, what have we learned about failing? How are we going to make this launch an utter failure? We want to get fired. Yeah? Yeah. Okay. So what we’ve learned is we want to make it big, we want to make it monolithic, and we want to make it irreversible. Those are the elements of a failure. But, I’ve been informed that some of you may still want to keep your jobs and sleep at night, and feel comfortable with where you are in the world. So, what is the reverse of that? Well, we want to make it smaller, and faster. The smaller your release is, the faster you can go. This is the whole point of Nicole Forsgren’s book Accelerate. If your releases are small and less dangerous, you can do them faster and your cadence can be faster and you can keep going.
Heidi: I have an analogy that works great for this in the United States where we are savages who teach our children to ride our bikes all at once with the pedaling and the steering and the balancing. In Europe, you’re all more civilized and you teach the kids on Strider bikes. Right? But in the US, what you do is you take the kid and you say, “If you pedal fast enough, you won’t fall down.” And then you fling them at the pavement, and the kid’s like, “This seems terrible, I do not want to go fast, pavement hurts.” What I know as an adult is that if you go fast enough, you won’t fall down. But what they know as a kid is going fast seems really dangerous. This is true about software release too; the faster you go, the safer you are because everything is smaller and more incremental. You want to make it decoupled. You want to make it so that if one thing fails, not everything fails, just some of the things failed. That’s okay. We can keep going, we can fail forward. And you want to make it reversible, you want to have a way to back everything out, and you want to not make a deploy. Because if you make something a deploy, then you have to redeploy it, and if you have to redeploy it, the odds of introducing a problem are more than zero.
Heidi: So, if something is too big to fail, what do you do? Observability. You keep track of what’s going on. This ship is part of the North Atlantic Iceberg Watch Patrol, which is still a thing that exists, even in the age of satellites, and they have iceberg forecast for around Greenland so that ships do not run into icebergs anymore. There’s something like that in the South Atlantic too. Basically, all the time, there are people out there saying, “Is that an iceberg? Is that an iceberg? Would we know if that was an iceberg?” It’s like the observability of things that kill ships. If it’s something that’s too integrated what do we do? We break it up into services. What if the scanner and the fax machine were different? And then what if we get rid of fax machines forever because all they do is scream and use paper? I have politically refused to learn how to use a fax machine for years, because if I know-how, people will make me do it.
Heidi: And what about permanence? I can’t help you with glitter, just no glitter ever. Our CTO has a policy for our children, they are just not allowed to own anything with glitter. I don’t think that’s going to work out for him forever, but points for effort. If something is permanent, figure out how you can make it less permanent. Figure out how you could do it more gradually. Figure out how you could have not a blue-green deployment, but a teal deployment, where you’re doing it all at once rather than dumping any information.
Heidi: So, we think about successes, because we have the technology. Like Y2K, is anyone here old enough to remember working in Y2K? Excellent. It was … we weren’t making it up. I realize that everybody who started work after that feels like there’s sort of a lot of drama about this, but it really was that bad, and we spent a lot of effort fixing it, and I personally plan to retire before 2038, because that’s not going to go well either. But we know how to fix that problem, right? We fixed it. It was a miracle. The world kept ticking. Or the Mars Rover. This is Curiosity. This thing was a golf cart-sized rover designed to travel 1,000 meters, and operate on Mars for 90 days. It traveled 45 kilometers, and logged 5,000 days as of February 2018. That is a success. That is a stellar success. That is so far beyond its mission parameters that we just can’t even see them, but that far back.
Heidi: But, we’re really probably not going to launch the Mars Rover that lasts multiples of how long it’s supposed to manage. Instead, what we’re going to have in our life is partial success, semi-success, successful failures. The best example of a successful failure that everyone knows about is Apollo 13. They launched this mission, they’re going to go to the moon, they’re going to go check out the far side of the moon, they’re going to do a bunch of things, and it blew up. But it didn’t blow up all the way. It only kind of blew up, and there were very scary things happening on that ship at the time, and they were running out of carbon dioxide filters, and they were getting poisoned. And people kept innovating, and kept running to stay ahead of these crisis, and they managed to get all three astronauts back to Earth safely, despite the fact that, let’s be clear here, their spaceship had a big, honking hole in it. Not an ideal condition for spaceships, because space is kind of not great for us. Humans are very bad at tolerating space, we’re not water bears.
Heidi: We’re prone to think of success and failure as binary, a test passed or it didn’t, service is up, or it’s not. But that’s not how our systems actually work. They’re always in some state of brokenness. They need upgrading, they’re running slowly, they’re not parsing all the data, this one isn’t a little bit perfect; we have dependencies that might be left-pad. We are not in control of our systems. This is the talk that I’m giving on Wednesday, it’s everything is imperfect, and the only thing that we can do about the fact that we are part of a complex and connected world is make sure that we have built in some safety devices for ourselves so that we can make sure to keep going, to come back to Earth eventually, even if somebody has blown a hole in our spaceship.
Heidi: One of the real life examples for this is circuit breakers and ground fault interrupt. So, a circuit breaker exists because humans are vulnerable to electricity and also not as fast as electricity. So, before we do something that sets the house on fire or kills us, we would like to induce a failure. Circuit breaker is a designed failure that just makes things stop happening. They’re like, “Oh, you’re using electricity improperly, I think I’ll just shut that off now. Right now. Yeah, none for you.” Because having a human with our slow reflexes and our need for reaction speed is not going to work out well in a life critical situation. I would rather have the whole house go dark than have somebody die because they stuck a fork in the socket.
Heidi: So, when we’re thinking about success and failure, I don’t want you to be thinking, “Heidi said we can never be perfect, so it’s not worth trying.” I want you to think, “How can I be the most perfect and most resilient possible given the fact that I live in a terrible, imperfect world? And how can I design my systems around that, around the fact that sometimes my feeds won’t come in, and sometimes they won’t go out, and sometimes somebody will cut a line with their backhoe, and sometimes … all sorts of things happen. How can I design around that going badly?” Because that’s really what I want for you. I want you to have problems, and not disasters.
Heidi: A disaster is a collection of problems, a set of things that go wrong. It’s never just one thing that goes wrong, it’s the combined efforts of a team of people and materials and decisions screwing up. When you think about every disaster you’ve read about, it’s never one thing went wrong. It’s never, “The ship was not designed to hit icebergs.” It’s, “And the captain was going very fast, and it was dark, and there was a steam engine explosion.” It’s all of these things together that really make something a disaster. So when you think about a disaster, what you’re trying to do is put in firebreaks so that things don’t continue to spiral out.
Heidi: So, if this talk was too long and you read Twitter instead, here are the things I want you to take away. Failure is inevitable. All our systems fail, everything is flawed, humanity is doomed to imperfection. But disasters are not inevitable. As you can tell from the fact that we come here by jet planes, and we get on trains, and most of the time we don’t die, disasters are not inevitable. Build your systems for flawed success. Build your systems in the understanding that things will go wrong, and you need to be ready for that. And because I like torturing my marketing team, if you would like a free t-shirt, which we did not bring because who wants to travel with t-shirts, you can visit this URL, and we will mail you a t-shirt, or we have stickers and books on the back table. Thank you very much for your time and attention, and here’s Tom.
Tom: Thanks, Heidi. All right, actually I did bring a t-shirt or two with me, so we might have to do one of those raffles or something.