If the goal is delivering higher-quality software with less risk, it may sound counterintuitive to release faster and more frequently.

And yet, releasing at a quicker cadence can actually make your process safer, according to our Principal Developer Advocate, Heidi Waterhouse.

So if you're skeptical of how concepts like shipping faster and testing in production can actually improve your end result and strengthen your process, watch the the presentation below or dive into the full transcript.


Transcript:

Helen Beal:
Okay, I think it's probably time to get started. So hello everyone, and welcome to this InfoQ webinar, "Safe and Sensible Deployment and Launch with Reduced Risks" with feature management experts at LaunchDarkly. I'm Helen Beal, and I'm your moderator today. Our speaker is Heidi Waterhouse, the Principal Developer Advocate at LaunchDarkly. Heidi is going to present her thoughts to us for around half an hour and then we'll be having a conversation and taking your questions.

So when they occur to you, put them in the question box on your control panel and we'll answer them after Heidi's talk. Heidi, you and I have some things in common like having degrees in English literature and language. What else should the audience know about you?

Heidi Waterhouse:
Uh, I am in Minnesota where it is really, really cold right now. So I'm happy that I did not have to get out and go to the airport this morning because it's nine-below zero Fahrenheit when I got up this morning. Um, I sew all of my conference clothes, even now, which gives me a place to put the, uh, battery pack for the mic in the before times and now it just gives me a good place to clip the mic. And I really find DevOps is an interesting transformation because I've been doing software since the extreme programming days and I think it's fascinating to watch how it's developed.

Helen Beal:
That is just perfect, that's genius, making yourself little pockets. I love it. So tell us all about being safe and sensible.

Heidi Waterhouse:
All right. I'm excited to have you here today. I hope that you get something out of this presentation. Please remember that you can ask questions at any time, and we'll be happy to answer them afterwards. So I put together this presentation on being safe and sensible because I've been spending a lot of time talking about feature management and feature flags, and I get a lot of pushback about how dangerous they seem and how creepy it is for somebody who's doing enterprise level work to talk about going faster because it feels really unsafe.

Heidi Waterhouse:
So why is there so much emphasis on speed in the CI/CD world? Doesn't anyone understand that it's important to be sure to release only sound software? We're working on insurance and banking and healthcare. We're not Facebook, we can't, we cannot move fast and break things. It's deeply unappealing to anybody who's working in a regulated industry or who does operations because the things that break are seldom a developer's problem. They're almost always an op's problem.

So why am I in front of you today to talk about going faster if that's not what is going to be appealing to you? Am I just a hot shot vendor spouting nonsense that doesn't make any sense for your environment? Am I like Kelsey Hightower explaining to you that you shouldn't use Kubernetes in production because it wasn't ready yet? Do you know how many people adopted Kubernetes after that talk? It was appalling. He was appalled.

null

I might be a vendor who's just spouting nonsense, but I think I have something valuable to offer you, a way to think about being faster and safer. This talk is not a book report on "Accelerate," but it could be. "Accelerate" is a book about how deploying more often is intrinsically safer, and leads to happier teams. Dr. Forsgren goes into the research in this book, but I have some illustrations that might help you understand why going faster is not counterproductive to safety, but in fact, increases it.

So you should see a thing pop up on your screen that is asking you a question. How often do you deploy? Is it once a day or even more? Is it once a week, once a quarter? Don't ask how often I deploy, it's kind of embarrassing. Go ahead and answer that poll because I think it will be useful for you to like, stop and think about how a- often you deploy.

So when I'm talking about going fast, what I'm talking about is continually integrating and testing your code and maybe deploying it. It's not about waiting for long release cycles, or major revisions, at least internally. External releases and internal deployment don't need to be the same thing. In fact, in a lot of industries and a lot of software, it's a terrible idea to keep moving people's cheese, software experience wise, but that doesn't mean that we can't be doing things behind the scenes to get ready for a big change.

There are good reasons why external changes are complicated. The US Army estimates that every time Word does a major release, they spend something like $6 million retraining the people who use Microsoft Word in the US Army to use the new thing. And that's just one example of how much it can cost to make a big external change. And that's one of the things that constrains us from doing changes often, but we don't need to be hung up on that.

And then safe. Safe is about making sure that changes don't cause harm, and ideally, add some value. We worry about changes because we want to be sure that we're doing the right thing. Every time we change an experience or a behavior, there is a risk that it could go wrong. If you make small changes, there's less chance of enough going wrong than it's going to break something. This is the theory behind fixing a typo. Unless you're using certain versions of again, Microsoft Word, fixing a typo isn't going to break a text document even though it might break a code document, you want to make small changes.

null

I think of it this way. It takes real effort to lose $1,000 at penny slots because you're making tiny, tiny bets. Odds are, you'll need a break before you can feed that many pennies into a slot machine. But if you're playing at the $50 blackjack table, it might not take you long at all. You want the same thing for all of your deployments, for your bets to be so tiny, that losing some of them is small in the overall scope of what you're doing. You want to nudge the wheel just a little bit, you want to do small course corrections, no sudden jerks unless there's an emergency.

I'm from Minnesota, and like I said, it's really, really cold here this week. It's actually cold enough that the ice on the road sublimates, it doesn't even melt, it just evaporates. But when it's icy on the roads, we know that it's better to be in slightly the wrong place than to make a sudden move and break our traction lock with the road such as it is. Being able to gently nudge your software gives you a lot more safety than needing to make a sudden 90-degree turn.

So when you're thinking about why going faster with deployment might be safer, we're also thinking about how making smaller changes more often could be less dangerous. And you want to be able to answer honest questions. A team at Microsoft told me that their code doesn't count as finished until it's tested, and released and returning metrics. I thought that was such a great definition of done. 

So often we stop thinking about code as our problem once we've released it, but returning metrics gives us the ability to understand how things are performing in the real world, in production, and how we need to be adjusting and changing how it operates in production.

Nothing is safe if we don't know what it's doing. It's just not complaining loudly enough for us to hear it. But we've all had silent failures and we all know that those are the worst kind of failures because we don't know that anything is wrong. So let's talk about the answers from your first poll. Can you put those up?

null

All right, we are very evenly split. About 23% of you deployed daily or more. About a third of you deploy weekly, 23% monthly, 23% quarterly. Fortunately for everybody, there's no one in the don't ask category, but I assure you I do talk to people who are like, yes, we r- release every half year. And it's a major production and it takes like a month to get ready for and everything is terrible. So this deployment speed is something that we talked about in the "Accelerate" book about how when you deploy faster, it becomes less of a deal.

Uh, when you're learning to cook the first time you make a fried egg, it's kind of hard. And the second time you make a fried egg, it's kind of easier. But if you're a short order cook, you don't even think about making fried eggs, you just lay the eggs out and eggness happens. And that's what we want for our deploys. We don't want to have to think about it. We want to just be able to do it.

So here's your second poll question. Are deployment and release the same in your organization? But I'm not going to define those yet because I want you to answer first. 

Lots of people use these interchangeably and I think it's a, an interesting problem. All right. Deployment is not release. Remember what I said earlier about how deployment could go on behind the scenes and not affect release, and that we could all make small changes in deployment that didn't translate to user experience?

In order to do continuous integration, continuous deployment, continuous delivery, you need to really internalize that getting your code into production and letting users see it are two entirely different questions. They're not the same thing, deployment is putting things into production and release is letting people see that deployment is a software problem. It's a software organization problem. It's who codes things and whose ops releases a business problem, releases when users want to have it, but they're entirely separate areas of control.

null

And when you separate those out, when you start thinking of them as different, you start to see why continuous deployment makes sense even if you can't be in an industry where continuous release makes sense. So let's go ahead and look at your answers for that. I got a sneak peek at these and I'm not surprised. For 62% of you, deployment and release are the same thing. Wow, what if we can decouple those? What if I said to you, "You can deploy on Friday, and let everything soak in production, but nobody sees it." Nobody knows that it's there. It's not affecting anybody.

And then when you're ready to release in a week, or at your quarter, or whenever it is, then you turn it on. And everybody has the same experience. It's a revolution in how you think about CI/CD. Let's talk a little bit about what we mean by the difference between experience and function. All right? Operational is not the same as perfect. Have you ever been on a flight, do you remember flights? I remember flights.

Have you ever been on a flight that was delayed for something that seemed completely trivial? Like a broken latch on a bin or a seatbelt? That's because planes are made of hundreds of thousands of parts. And lots of them are critical for health and safety. And there are some things that force an immediate grounding, but there are some things that exist in an error budget. A set of things that could go wrong without directly endangering anyone, but shouldn't occur in large numbers because it reflects the state of maintenance of the plane.

If a plane exceeds its error budget, even with a broken latch bid, it gets grounded until the number of broken items dips back below the error budget. All right, we're thinking about that. Every plane you ever gotten on is broken, it's just not fatally broken. And every release you've ever made is probably broken. It's just not fatally broken. 

So let's think about networks and network capacity. We've learned that computers and networks operate really poorly as we approach 100% capacity. That's because the break point is at 100%.

It's much better to over provision a network and run it 80% so that if there is a sudden spike, we don't break and we can keep running. When we reverse that network thinking to software, we can think of designing our systems. Our microservices are known fragile parts to work effectively, even if they can never be perfect. We can build some slop in, we can have some redundancy. A perfectly optimized system is very brittle because it does exactly what you need it to do and nothing more and we don't live in a world where nothing more is ever demanded.

Sometimes things go wrong. Sometimes we need something to happen. This talk is about feeling safe. I really want you to leave this talk inspired to make your workplace feel safer for you. And one of the ways that we do that, and again, it comes out in the "Accelerate" book is psychology. It turns out that being safe on our teams, feeling safe as humans, lets us make better software avoiding errors is not actually the best way to code something that's solid.

In fact, we need to embrace the fact that there will be errors in order to code something that's resilient. So here's the third poll question. What makes deployment psychologically difficult for you? What makes it scary? What's the hardest part of how deployment goes for you? And remember, I'm talking about deployment, not release. What is the hardest part? Is it testing? Is it getting all of those tests in and also the test from QA who makes your software do things that it was never intended to do?

Is it compliance? You have to fill out all the ticky boxes and they're very important ticky boxes, but you have to fill them all out? Is it risk? Is it like if we launched this, we could take everything down? Tell me what you're scared of. We'll talk about that in a bit. What makes you feel safe? You want to have a trustworthy team, you want to have management that's trustworthy and you want to believe that deploying faster is going to be safer.

null

The way I explained it is this, when you're an adult, a parent who's teaching a kid to ride a bike, the thing you say to them is pedal faster, and you won't fall down. When you're a kid who's learning to ride a bike, you're like, "I've fallen down before, I'm pretty sure concrete is hard. I'm not sure how going faster is going to make that better." It seems like that would hurt worse. As an adult, you understand that physics means that going faster means that you have more angular momentum, and blah-blah-blah you're going to stay upright.

As a kid, you don't have that faith until you can experience this. This is why so many of us had a parent run along behind us and push so that we could get up to speed and start to feel, start to experience how much safer it is to go faster. So that's one aspect of psychological safety. 

Another aspect is that your team communication needs to be blameless, my therapist, my kid's counselors, and the character I have in a cartoon over my desk, I'll say the same thing. Everyone's doing their best given their circumstances.

People really don't maliciously fail or if they do, it's because they're reacting to structural issues. If they're doing things that seem harmful to us, it's because they're maximizing for something that we don't see or don't care about. I know it feels like people are jerks on purpose sometimes. What they are doing is maximizing something you don't care about. I think that a lot of the mask dialogue is about this. It's like I care about being able to breathe freely, or I care about being able to go out in public.

Like they're- they're maximizing different things. So what that means on a team doing software is we need to not talk about people like they meant to screw up because they didn't. The system has brought them to a place that they can't compensate for. Their skills are insufficient to deal with the broken system that they're trying to work in. It's an interesting thing to think about on your teams because I think we've all failed in our teams and we've all felt bad about it, but it takes a special kind of safe place to be able to say, "I failed, and here's the thing that our culture, our environment did that contributed to that failure."

It needs to be non-catastrophic. Mistakes are going to happen by the very nature of humanity. We can work to minimize the frequency, but our systems are both complex and involve humans and we can't get around that. So let the mistakes be small. Something you can correct with a nudge or a rollback or a flag flip. Something that isn't about downtime measured in hours, or costs measured in millions. It's a lot easier to be blameless when the whole incident report from report time to recovery takes less time than the post-event retro.

null

Make your changes small, use steering techniques appropriate to your speed and vehicle. The smaller the bet, the less catastrophic the outcome. So if you think about somebody again, who's- who's betting in Vegas like if you're betting your- your pocket money, your allowance, it's one thing if you're betting the mortgage, it's another thing and one of them is going to get you a lot more blame at home. The last aspect of psychological safety is curiosity. You can't be curious if you're feeling unsafe, you can't really ask good questions if the main question in your head all the time is, am I going to get fired?

So when we're psychologically safe, we have room to be curious. Why did it happen that way? What in the system caused that? What could we do differently? There's something we're not thinking about. 

My wife used to do tech support at a university and she had this great story about a monitor that gave one of her users all sorts of trouble going out intermittently, but it never made sense. It wasn't timing, it wasn't software, it wasn't hardware. She finally got the idea to walk around and look at what else was going on and it turns out the monitor was against a wall that had an electron microscope in it.

This makes sense in a university, but it's unlikely to happen to most of us. The thing is the problem was solved because she was curious about what else could cause the outages besides the usual things. This is what the observability people are talking about. High cardinality data, and the ability to interrogate it gives you the chance to ask questions you couldn't predict needing to know about. Logging is about asking questions that you can predict. Observability is about asking questions about the unknown, unknown.

And I think making a psychologically safe team makes observability make more sense. In an unsafe team where there's a lot of blame going on, observability is harder to sell. One of the ways that we get psychological safety is by giving ourselves guardrails. If we know that there's something dangerous, we can add layers that prevent the bad thing from happening. Are you excited about the idea of Apple bringing back MagSafe connectors for laptops? Have you ever actually completely destroyed a laptop by tripping on the cord?

But even if you haven't, you're going to feel safer knowing it won't happen because of the MagSafe connector. It's a psychological safety that you find valuable. So there are two aspects to this. The first one is risk reduction. Risk reduction is trying to make sure the bad thing doesn't happen. It's things like vaccination, and anti-lock brakes and train gates. They prevent the bad thing from happening. They are guardrails that prevent us from going off the road.

Risk reduction is a really important part of how we design our systems, but we also have harm mitigation. Harm mitigation doesn't matter until something goes wrong. No matter how much we try to avoid and mitigate or avoid risk, the truth is that bad things do happen. That's why we wear seat belts and life vest and why we install smoke detectors and sprinkler systems and doors that swing outward. Ideally, we wouldn't have house fires in the first place, but if we do, we can design our buildings so we don't die of them.

Things like seat belts and building codes and RAID arrays exist because we know the bad things are going to happen. Drives do fail, buildings do catch fire, car accidents do happen. So if that's going to happen, how can we make this less catastrophic? How can we make bad outcomes less bad? Well, let's talk about the beautiful future of where I hope we're heading. In the beautiful future, a lot of our work has been taken by robots or automation. They're not smarter than we are, but they never accidentally add an extra space in Python.

As we climb up layers of the extraction from machine language, we get closer to telling programs what to do instead of how to do it. After all, we can only tell computers to be slightly dumber, it's more patient than we are. This is the airbag construction from the Pathfinder mission to Mars. This is a thing that NASA and JPL launched and they didn't know how it was going to land. But they knew it was going to land safely because it was going to bounce. All of these games like “Angry Birds” that have physics engines, nobody exactly sat down and calculated what each pixel deviant path was going to do.

We built an engine and then we said to the engine, take care of figuring this out. We want to test the experience of what people are- are having and not what we intended to do. It's really easy for us to think of tests and test results is real, but they aren't. They're symptoms of the living, evolving organism that is software. We have so many layers of distraction that we don't actually know what's going on. We just believe we do and the sooner we understand that we're dealing with a system and not just a program, the more we're going to learn to ask the right questions.

When we go to the doctor, the doctor doesn't ask us if our thyroid is feeling good because we don't know. The doctor asked if our hair is thin, if our muscle tone is bad, if our fingernails keep breaking, and if it seems like we might have a thyroid problem based on those symptoms, the doctor does a blood test and says, "What did they say? They say your thyroid stimulating hormone is very high, your body is trying to get your thyroid to kick in and it's not working very well."

null

We can't directly test thyroid hormone, we can only test the thyroid-stimulating hormone that tells us what our body thinks is going on. I think that's an important thing to remember when we're thinking about how testing in software works. And this quote is like the way to sum all that up. You can have as many nines as you want in your service. But if your users aren't happy, it doesn't matter. It doesn't matter if you have uptime if you don't have an experience that people are finding useful and reliable.

I want you to do all your things in production. Remember how I said that deployment was getting things to production and release was getting things to people? If you get things to production, you're going to be able to test them. Maybe some staging, but you can't replicate the level of traffic that you're getting on your service in any way other than production. You're not going to find the edge cases and problems without testing the amount of traffic. We have an enormous difficulty replicating the complexity, scale, interoperation and generalized weirdness of production.

You're going to pay for licenses for all of those APIs that your production server is linked to. Are you going to do artificial data that is weird and, uh, unnormalized on your test server? No. Test it in production, just don't show people that you're testing it in production. Test it in production, but keep it away from the general public because when you think about it, we're all testing in production.

We all find things in production that cause outages or poor experience, just some of us are doing it on purpose and some of us are finding out about it the hard way. So let's talk about the answers to question three. 

null

What makes your deployment hard? Almost all of you said it's not testing, we have automated testing for most of it. And it's not compliance, we know how to fill out the ticky boxes. It's risk. A thing that makes deployments hard is risk and the thing that makes deployments risky is making them too big and too slow and too massive.

If you make them smaller, it's going to be less risky, I promise. So what can we do about this state of constant breakage and unreliability? How can we make our teams feel like it's possible to try new things without taking risk? How does going faster matter? Well, it depends on how crucial your product is. If you're in a life critical software system, almost none of this talk is directly relevant. Some of it's indirectly relevant, but you have different constraints. But how crucial is your product, and how many people are using it?

If it's a news service, how spike resistant is it? Because the times that you get spikes are because people need the news. 

So think about your product and your audience and figure out how much failure is tolerable. Reduce the potential for catastrophes. A catastrophe is when a bunch of errors pile up together. Humans are slow in reacting to emergencies, that's why we have circuit breakers. When something goes wrong, the circuit breaker makes sure the system fails to a safer state rather than causing harm.

So producing cascade failures is part of how we make sure that our deployments are less risky. Putting stops in to make sure that failures are contained to one area of the software means that you can keep working and reduce capacity. And adding control points to your system allows you to react flexibly instead of just turning it all on or all off. Maybe you're having page load problems and you could take off like the- the user ratings at the bottom and still have your core business functionality.

Think about where you could add controls to bypass third-party integrations or known flaky servers. So here are the things that help reduce catastrophes, by breaking your risk up into smaller chunks. Circuit breakers, isolations, control points, the ability to do rapid rollback, layered access so that people can't just YOLO things into the public and using abstractions to manage your change because failure is inevitable.

null

It is the condition of everything we do, every complex system has a lot of failure modes, but catastrophes are almost never the result of a single failure. Instead, they're a collection of small failures that systems of safety protect us from most of the time. It doesn't matter that you put the lid on your travel mug wrong until you start to grab it, uh, drop it and grab it by the lid. Most of the time, it's not going to matter until it does.

So if this talk was too long, and you read Twitter instead, let me say if you want to release often, and you do, make it safer, and less scary by making it smaller, and more often. So thank you for your time. I'm happy to a- answer your question. It's going to be great.

Q&A Session

Helen Beal:
Oh, it was fantastic. Thanks so much Heidi. We have some feedback from Eric in the audience, almost a religious fervor when you were talking about psychological safety and particularly, the elements of being blameless. He said he's really concerned about this for him, um, and others. Um, I've done some research into blame and actually one of the things I've discovered in some of the reports I read is that blame is actually addictive in the brain, so it kicks off your addiction circuit, so we really like it as humans.

I assume that that's because we're kind of making ourselves free and it feels good that we've- we've- we've now got that stressed away from ourselves. But I wonder- wondered what tips you have for the audience on how to move from a culture of blame into a culture of accountability, what- what works?

Heidi Waterhouse:
So Etsy did a lot of pioneering research on how to do blameless, uh, PagerDuty has certainly continued their ... It- like, I realized they're not all doing incident response, but their discussion about how to make a culture more, less- less, less blameful and more investigative and researching. Uh, those are the two places that I'd look to start, but I think it's useful to understand why you're being rewarded for blaming people.

Like I did a talk a couple years ago about dog training, and humans. And honestly, all of the best management books I've read are dog training books because once you take the like human interaction element out, you can really see that we do things that we are rewarded for. And we avoid things that we are punished for. But actually, the reward stimulus is much stronger. So when we want to make a more blameless culture, what we need to do is start rewarding people for doing the right thing.

So I've worked at companies where people get an award for taking down production. They're like, "Hey, you found a new way to take down production, you get, you know, ad-, like, we fixed it now, you get a dinner out for finding that flaw. And now we fixed it, and it's never going to go down that way again." That makes it a lot less terrifying to report that you've taken down production if you're going to get a treat for it than if you're going to get fired for it.

null

I really like the stories about AWS where like if S3 goes down, we all have a bad day, right? Um, and somebody said, so the- the guy who pushed the button that took S3 down, I think it was like five years ago, uh, did he get fired? And they're like, "First of all, we're not telling you who- who it was because that would lead to blame. And second of all, no, we just spent millions of dollars teaching him never to do that again (laughs), and teaching our organization that we needed a guardrail there."

It's not human error, it's a system failure of there's also a bunch of stuff uh, Paul J. Ree- or J. Paul Reed, uh, does a bunch of stuff on how to do blameless culture and risk and- and analysis. Um, I'm trying to remember the name of the conference, it was really great. I want to say it was REdeploy, uh, where we just talked about failure. So talk about failure in normal life, it- and when you make it smaller, it's a lot easier to not get completely bent out of shape.

I- I love that dog training idea and I love that you- you really get how the brain is working. So what's happening there kind of from a familiar neurology perspective is right at the back in our amygdala. Um, we've got our kind of fight, flight response, our kind of basic lizard brain. So we've got, um, our avoidance response when we're afraid of something and then as you've said, um, the novelty, um, drives our approach response and that is stronger. So if we can switch people from, um, fear and avoidance to novelty and approach that really, really helps, but, I think [crosstalk 00:35:11]

Heidi Waterhouse:
Oh sorry.

Helen Beal:
That's okay.

Matt Stratt