Chaos Engineering and Continuous Verification in Production

On April 30, Casey Rosenthal, CEO and Co-Founder of Verica.io, spoke at our Test in Production Meetup on Twitch.

Casey debunked the, explaining how many of the intuitive steps people take to make systems more reliable are actually counterproductive. Moreover, he explained how Continuous Verification can help software engineers avoid such pitfalls.

Watch Casey's full talk.

FULL TRANSCRIPT:

Yoz Grahame: April the 30th 2020. My name is Yoz Grahame. And I am a developer advocate for LaunchDarkly. Joining me today is Casey Rosenthal, the CEO and founder of verica.io. Hi, Casey, thank you for joining us.

Casey Rosenthal: It's my pleasure. Happy to be here.

Yoz Grahame: Thank you. And sorry we started a bit late today because I was trying ... I suddenly remembered that I had to get my waistcoat on. Somehow I was without my waistcoat. And in times like these, I think it's all important to keep the semblance or mirage of appearances.

But so, Casey, you are the man who literally wrote the book on chaos engineering or co-wrote the book, which is recently out from O'Reilly.

Casey Rosenthal: Yup. Cowrote it with Nora Jones. And we had a bunch of contributing authors, which was great. We have perspectives from companies like Google, Slack, Microsoft, LinkedIn, Capital One, stuff like that.

Yoz Grahame: Cool. I had a brief skim through it before and there's some fascinating stuff, which I believe you're about to go over now. So, let's take it through to the slides.

And there will be questions and answers from the audience afterwards. So, if you're listening, please join in the chat. And pose any questions. Casey will be talking for about 20 minutes or so. And then we will open it up. So, take it away, Casey.

Casey Rosenthal: Great. And yes, let's make sure there are at least questions and hopefully answers. Just to verify, you can see my title slide here?

Yoz Grahame: Yeah.

Casey Rosenthal: Okay, perfect. So, I'm going to talk about a couple different mental models and how those relate back to doing things in production and some of the evolution that we're seeing in, for lack of a better word, DevOps and best practices across the industry.

So, I work at a company called Verica, not going to talk about that though. I'm going to try something a little bit different. I'm going to start by presenting a couple of myths about reliable systems about what makes systems reliable. And this is partly to stir up controversy, but also I just want to see how people react to me starting off with these instead of leading people into them.

So, one of the myths of reliable systems is that you can make a system more reliable if you remove the people who cause accidents. So, this assumes that there are people who are accident prone or less careful. It's sometimes referred to as the bad apples management principle. And the data just overwhelmingly shows that this is not the case. Unfortunately, the medical system in the US has occasionally tried things like this.

For example, something like 80% of the malpractice lawsuits are associated with like 5% of the doctors in the US, which on the face of it sounds like, oh, well, clearly those are bad doctors, get rid of them and the case that malpractice cases will go down. Unfortunately, just the opposite, doctors who are particularly good at handling high-risk cases tend to congregate in that area.

And so, that's why they're associated with the higher numbers of malpractice suits. So, if you get rid of them, you're getting rid of the people who actually have the most expertise to handle those kinds of situations. So, the same applies for your system. If you have a team or a person who seems to be coincidentally associated with reliability issues, getting rid of that person is not likely to make the system more reliable.

Second myth is document best practices and runbooks that will make a system more reliable. Not true. I'm not saying that you shouldn't document things, particularly if that's an effective way to communicate within your organization. However, most critical outages are unique. So, there are no best practices for that. I mean, let's be honest, there are no best practices in industry. There are just common industry practices. We have no idea what would constitute a best practice in a complex system.

And runbooks again, while they can be an effective way for somebody to codify their own personal knowledge or communicate that in order to really have a reliable system in order to move the needle on reliability, you need people with experience. Runbooks don't suffice. Humans can't download experience to each other in that way. Particularly, not the kind of experience that is required for improvisation and adaptation.

Third myth, if you identify prior root causes and defend against them, you'll make your system more reliable. This is particularly appealing. Again, all these myths are intuitive, right? That's why they're easy to believe. But there's a couple things wrong with this. One is if you just identify vulnerabilities and defend against those in complex system, again, you're not necessarily preventing the things that generate new unique vulnerabilities.

The other is that root cause doesn't exist in a complex system. And in the interest of being mildly controversial, I'll go a step further and say, if you are doing root cause analysis, at best, you're wasting your time. Because root cause doesn't exist in complex systems and that the practice of root cause analysis is a bureaucratic exercise to assign blame in a particular narrative that doesn't actually benefit the system or make it more reliable. But love to dive more into that in that Q&A part of this meetup.

The fourth is that you can enforce procedures. This assumes that somebody knows better and that the people doing the work actually just aren't following the rules. That's almost never the case in a system that needs improvement for reliability, particularly in a complex system. Usually, that's more indicative of a mismatch between how work is imagined by people higher up in the hierarchy and how work is actually done on the frontlines.

Myth five is that if you avoid risk, you'll make the system safer. This seems very intuitive. Don't do the risky things. We hear this in the enterprise a lot now, the phrase guardrails. If you just put guardrails in and avoid risk, then you make the system safer. So, guardrails do two things that are counterproductive to making a system more reliable. The first is that they prevent people from gaining the experience necessary to adapt and improvise and innovate to make a system more reliable or to remediate an incident as it's happening.

And the other thing that guardrails do is they tend to become obstacles for the people best positioned to remediate an incident while it's happening. So, again, as intuitive as this sounds, it turns out to be it's counterproductive in practice.

Sixth myth is you simplify your system, it'll become more highly available or safer. Complex systems, the problem is the complexity. So, just take the complexity out of it and you'll have a better system. Again, a couple things wrong with this one is that empirically, we know that more complex systems are safer or more available than simpler systems. So, the data just backs that up. But complexity also tracks with business success. So, your customers are paying you for complexity, like that's as one way to view your job as a software engineer is you're adding complexity to a product. So, if you remove complexity, you're actually removing the ceiling for how successful your products can be.

Another way to look at this as accidental complexity is going to always grow as a process of doing work in software and essential complexity, that's complexity put in there on purpose, that's feature development and stuff like that, that is also going to grow. So, if both forms of complexity are always going to grow, we don't have a way as an industry of sustainably reducing complexity. So, that is not going to make your system more reliable.

And the last myth that I'll mention here is that redundancy equals better reliability. And this is really interesting. There's a lot more research to be done here. Because the relationship route between redundancy and availability in particular isn't as well understood as we would like. But we can certainly point to many examples where redundancy was a contributing factor to a failure to an incident. This is true in software, but famously, it's true in hardware and aviation rockets, the Challenger incident, where redundancy is again, actually a contributing factor to an outage or an incident or some sort of system failure. So, at best redundancy is orthogonal to reliability.

So, I'm going to start by presenting those and hopefully, there's something there for everybody to disagree with. But now, I'll present a couple mental models frameworks to think about things. And we'll tie it in with, well, with feature flags, but also with continuous verification and some other stuff.

So, here's one model that makes provides one framework to think about how engineers make the design decisions that they make every day that we don't necessarily talk about. So, in this model, engineers are basically in the middle between these three poles. The poles are economics, workload and safety. And you can imagine the engineers attached by a rubber bands to each of these three poles. And if they stray too far from one of these poles, the rubber band snaps and it's game over.

So, implicitly, engineers know that they have an intuition of how far they are from the economics pole. You probably do not have a rule at your company that new engineers are not allowed to spin up a million instances of AWS EC2. Why don't you have that kind of rule? Well, it's kind of common sense. Engineers implicitly understand that things cost money, they cost money, their team cost money, and they know they're not supposed to spend more money than they have. So, they try to balance that out as they make decisions throughout their day.

The same is true of workload. Engineers implicitly understand that their servers can handle some amount of work. They as people can only type so fast, can only crew a certain number of features per day, or have a certain amount of feature velocity, so they intuitively understand the relationship to that pole and they make sure that they don't drift too far from it or they know their servers will be overloaded, the resource will be overloaded or they'll burn out.

The same is true with safety. The difference is they have no intuition as to how close they are to that safety pole. And I'm comfortable making that generalization across the industry because we still have incidents. The reason we have outages and security breaches and things like that is because there are surprises. If we knew what was about to come as an engineer, we would stop what we're doing and we change what we're doing to make sure that that doesn't happen. We would implicitly change our behavior.

So, this is one of the models, one of the frameworks that provides a foundation for understanding chaos engineering. And we have the definition here at principlesofchaos.org. This is my favorite definition. The facilitation of experiments to uncover systemic weaknesses. So, what chaos engineering does is it teaches us about that safety margin. It helps us generate an intuition of how far we are from that safety pole. And just knowing that implicitly changes the decisions we make as engineers, which is quite powerful.

The other thing chaos engineering does here, here are some of the advanced principles, you can see right there in the middle, one of the advanced principles is run experiments in production. And we can talk about this, I hope there are questions about this. But this isn't to say that running experiments, chaos engineering experiments to uncover systemic weaknesses isn't also valuable in staging or nonproduction environments. It is. I provide plenty of examples of where learning there has been really useful.

But ultimately, in a complex system, you want to study the actual system you care about. And so, the gold standard, the highest bar for chaos engineering is running those experiments in the actual production system. So, that's one model.

I'll introduce another one here called the economic pillars of complexity. And again, this is just to frame how we think about that evolution of technology. So, in this model, there are four pillars, states relationships, the environment and reversibility. And the quintessential example of this model is Ford in the 1910s, where they were able to control the number of states in their manufacturing process by saying you can have whatever kind of car you want as long as it's a black Model T. And so, their ability to control one of these pillars help them navigate the complexity of their manufacturing or production service.

They were able to also control relationships. Other car companies had teams of people who built a car from the ground up. Ford moved to assembly, line manufacturing and scientific management and Taylorism to basically limit the relationships between the parts but also between the peoples that help them navigate the complexity of their marketplace.

They also control their environment or affected their environment by first breaking up the automotive trust in the US, which allowed them room to do business. And ultimately, they became somewhat of a monopoly themselves. So, they were able to control their environment that way.

Which brings us to the fourth one, reversibility. And here's where Ford couldn't really do anything because while you can put a car in reverse, you can't really reverse the manufacturing process. Reversibility was not something that Ford could optimize for.

So, how does this model apply to software? Well, states and we don't just mean data, but the functionality of applications, are usually increasing, usually you pay to add features as software engineer, so there's not much you can really do to control that pillar.

Relationships, unfortunately, we're in the business of adding layers of abstraction. So, whether we want to or not, the relationships between the moving parts and now more of us are working remotely, that adds a different dimension of complication to human relationships and explicit architectural changes, the things like micro services that we can uncouple the delivery cycle of features. There's not much we can do in software to control the relationships that's expanding.

The environment, most of us are not monopolies, so we can't affect our environment too much. Amazon, which brings us to the last one reversibility. Here's where software shines. And we can see obviously, it's software not hardware, but we can see this in great effect in first the move from waterfall to XP and agile, where waterfall said, "We'll plan the whole thing, we'll build it, we'll deliver it to the customer." After a year, the customers say, "No, that's not what I wanted. We'll say tough and then keep moving forward."

And so, XP said, "We'll just put something in front of the customer after a week or whatever." And the customer says, "No, that's not what I wanted." Okay, it's easy to reverse that decision. We can throw away a week easy enough. And then the second week, we'll put something else in front and say, I kind of, and hopefully by the third week, we're tracking with what the customer actually wants or needs.

That was a sophisticated implementation of reversibility as a way to navigate a new complex manufacturing production process. But we can explicitly make architectural decisions that prove reversibility as well. And there are a few better examples of this than feature flags.

So, feature flags help us navigate a complex system of manufacturing, producing software, by explicitly enabling us with this architectural reversibility decision and that's important. Other things that add, source control, automated canaries, blue-green deployments, chaos engineering as part of the feedback loop. Observability is part of the feedback loop. These things all improve reversibility and that pays dividends for navigating complex systems.

The last thing I'll mention here with respect to software and mental models is sorry for the block of text here, but "The chief merit of software engineering is its technical efficiency, with a premium placed on precision speed, expert control, continuity, discretion and optimal returns on input." A mouthful, but a pretty good definition of software engineering. So, they accidentally did this. The actual quote is, that's the chief merit of bureaucracy.

And I just want to make that point. Software engineering is in some circles known as the bureaucratic profession. And that sounds like a negative thing. And it probably is. So, why would somebody make that claim? Well, no other industry has done a better job of separating who decides what needs to be done from who decides how it needs to be done, from the people doing the actual work. That is an idealized bureaucracy.

So, chief architects, product managers, project managers, people managers, tech leads, all of those roles exist to take responsibility, decision-making responsibility away from the people doing the actual work. And this all goes back to scientific management Taylorism, from the 1910s, and we can discuss historically why that's the case. But just we should understand that this is the wrong software. This is the wrong model for software engineering if you consider software engineering to be knowledge work.

It's the right model or it's a model that's acceptable for manufacturing widgets, most of us are not manufacturing widgets. So, if you consider software engineering to be knowledge work in that hierarchy, that bureaucracy is a counterproductive model. So, all of this is to say we're not trying to fight complexity, we're trying to navigate it.

Very quickly, this is part of an evolution that we're seeing from CI, continuous integration, which limited the number of bugs that we can create by merging code faster so that the bugs become more apparent and don't geometrically lead to larger numbers of bugs. And then we saw that, okay, the engineers can create features faster. That evolved into continuous delivery, CD. Okay, now we can make software faster. We need to get it to production faster and roll forward faster, change our minds. So, there's an important part of reversibility that's enabled by this evolution.

And now, we're seeing the industry start to develop CV, continuous verification, which says, "Okay, if under the hood, we have all of this complex software creation moving very quickly, how do we keep our eyes on the road? How do we make sure that we get the behaviors out of a complex system that the business requires given that we're enabling very fast movement of feature velocity and getting that to production? Basically, how do we move fast and not break things?"

And here's another way of envisioning that. Continuous verification, I'm just reading across the top here, is a proactive experimentation tool that verifies system behaviors, whereas most of what the industry is currently familiar with follows more along the line on the bottom there. It's reactive testing methodologies that validate known properties.

And those things along the bottom are good. They're useful. We're not saying go do those, certainly not advocating that ditch testing. But those practices and disciplines were born in time of simpler systems and they're more attuned to simpler systems. In complex systems, you have to move up the scales to ensure the business properties that you want out of your complex systems.

So, tools don't create reliability, humans do, but tools can really help. We've got this book out. And there's a way you can get a free book if you go to this link. I would love to take questions now.

Yoz Grahame: So, can you hear me right, Casey?

Casey Rosenthal: You're good.

Yoz Grahame: I'm going to leave that slide on the screen for a while, just so I can copy the URL. And it's Kim, our producer and I, in the background while you've been talking have been absolutely squealing with joy because it's a whole bunch of buttons, this talk has hit a load of stuff that we have been talking about for a while.

I tweeted your quote about software engineering, separating those who decide what needs to be done from those who decide how to do it. Thank you so much for this. And there's so much to go into here. I mean, if you are watching, please give us your questions, your reactions, because especially because there's ... Actually, we will put the URL. Kim, if you could put the URL there in the comments so that other people can get to it while I switch the view here.

You warned us at the start that there were things that you would be going into that run quite counter to current practice. Thank you, Kim. And hang on a second. Let me get my camera back. So, there we go. Well, I should preface by saying we have been testing in production here with this dream and doing the chaos thing in the truest way, which is to be looking calm on the surface while paddling like hell underneath. For three minutes during your talk, I completely lost connectivity. So, I'm afraid there may be some bits, things I ask about that you already discussed. And please tell me if that happens because I don't want to bore anybody.

But there's so much to go into here. I think given the way that some of the things that you talked about in your myths that I feel somewhat attacked. I always thought documenting best practices and runbooks are vital and the point you made, I mean, if we could just go into that a bit more about the value of personal experience and how you see that playing out in the organizations you've worked with.

Casey Rosenthal: Yeah, sure. So, let's first talk about the definition of resilience. So, outside of software, there's this field of resilience engineering and the most effective I think definition of resilience is having an adaptive capacity to handle incidents. So, that adaptive capacity, it requires a level of improvisation, right? It requires humans who have skills and context to know how to improvise.

Yoz Grahame: Right.

Casey Rosenthal: A runbook simply doesn't have that. It simply can't have that. So, even if you have something very well documented, what you can't do in that runbook is provide enough context or skill or experience around that to let the person following the runbook know that it's even the right runbook that they should be following. Right?

If they have to consult the runbook, then it's basically a guess on their part. And it's a guess on their part, then you've basically capped the reliability at which your system can operate. So, you can't move the needle on reliability by investing more in runbooks is the takeaway there.

Now, documentation, if that's a way that you like to communicate, that's great. You should obviously get better. Communication is the hardest thing in most professions. So, you should obviously take advantage of the ways that you can do that well and invest and do that better, but that's not going to improve reliability.

Yoz Grahame: Right. I think it's fascinating because one of the trendiest books in the past 20 years around this is Atul Gawande's Checklist Manifesto. And I thought it was a comment that was the actual title of the book. But some things that came from the medical profession is this idea of rigorously adhering to checklists in crisis situations. Is that the same thing as runbooks? Or is there a difference that I'm missing there?

Casey Rosenthal: I think there's a difference. So, using a tool that you have experience with to help you get the job done, fine, great, absolutely. But that's different from a runbook that's trying to suffice, it's trying to supplant or stand, be sufficient to take the place of a human improvisation or adaptation. Does that make sense?

Yoz Grahame: Yes.

Casey Rosenthal: So, a runbook is a remediation tactic. A checklist should not be used as a remediation tactic. That's a tool to help you basically your experience forward.

So, maybe it's about diagnosis so that you can fully establish what assumptions you're making or give you the full set of facts before you start trying to remediate.

Casey Rosenthal: A lot of times, it's that checklists don't have nothing to do with remediation. So, a pilot going through a checklist before they take off, okay, making sure that you don't forget something? Sure. You're relying on a tool of the trade to help you bring your own experience forward.

Yoz Grahame: Yeah.

Casey Rosenthal: But again, much different from a runbook where-

Yoz Grahame: It's not remediation.

Casey Rosenthal: Yeah.

Yoz Grahame: Right, yeah, yeah, yeah. And that's also fascinating given that it plays into some of the old sayings and terrible jokes that we got from the '60s to the '80s about automation. In that the thing about the nuclear reactor has a dog and a man and the man is there to be at the controls and the dog is there to bite the man in case he tries to touch any of the controls.

It's the idea that the automated system is the reliable part and the humans are the unreliable organic part, but it's the organic nature and the ability to improvise that actually brings us the reliability.

Casey Rosenthal: Brings the resilience, yeah.

Yoz Grahame: Resilience, yes.

Casey Rosenthal: You can have a robust system with a lot of automation. But by definition, you can't make a system more resilient with automation. Because so far, at least we don't have the capability to automate improvisation.

Yoz Grahame: Right, and so, I've seen recently more work towards supposedly self-healing systems. I know that some companies, especially those in the performance or alerting and monitoring space are doing this self-healing thing, especially those trying to use AI as opposed to spot outliers or try and make connections between cause and effect and things like that. Have you seen any success there or anything there that you think is worthwhile?

Sure. Again, you can raise the floor of robustness by implementing and really self-healing is like calling something AI. It's like what we call self-healing will tomorrow just be like an algorithm. So, if you look at things like the bulwark pattern, circuit breakers or failure conditions, those were self-healing 10 or 20 years ago. So, having a system that detects something out of range or even alerting is a primitive form of self-healing, right?

Yoz Grahame: Right.

Casey Rosenthal: It just doesn't go all the way. It just brings a human in instead of something else. So, that's a perfectly valid algorithm for making a system more robust. But again, you're lacking that improvisation and context picture. So, putting the newer buzzwordy sound around it of self-healing doesn't make your system more resilient because it's not bringing in more intelligence than an engineer can think of ahead of time.

And what we know in complex systems is that what the defining quality of a complex system versus a simple one is that a complex system can't fit in one person's head.

Yoz Grahame: Right.

Casey Rosenthal: So, it's not reasonable to assume that a software engineer can think of all of the things ahead of time that can go wrong with the system.

Yoz Grahame: Right.

And if that's the case, then your self-healing algorithms can be really clever, really elaborate. But by definition, they will not cover all of the things that they need to cover because they rely on a human to think of those conditions ahead of time. And we've already admitted that that can't happen.

Yoz Grahame: Right.

Casey Rosenthal: So, chaos engineering in contrast helps the humans get the experience that they need to adapt better. And the output of that might lead to better automation. It might lead to self-healing algorithms or failure control algorithms that make the system more robust. But ultimately, it's informing the humans of things about the system that they didn't previously know. And that's what allows them to implicitly change their behavior to make the system more reliable.

Yoz Grahame: That paints the picture there, I think trying to get and I'm still somewhat new to some of this. But the impression, what you're saying is that effectively robustness and resilience are two different levels. And that resilience is a higher level of decision making that comes into play once the automated systems have failed, which they will do.

Casey Rosenthal: Yeah. I think that's a valid way to look at it. I view them as just different properties, like your system is robust in different ways up to a different point. And there's always some boundary where it's not robust past that point. The simplest thing we could think of is like you load test the system, eventually at some point, it's going to fail, any physical system.

Whereas resilience doesn't know if something's going to fail. There's no proof to say that, well, we could do this instead. There's always the option. You just can't enumerate all the options for doing something else instead.

Right, right. It definitely fits the pattern that we see talking about, that the level of complexity is increasing beyond our ability to manage it. And this is something that we do naturally, we've been doing for decades, is that this arms race between complexity and the tools to manage it and complexity always seems to win.

Well, I don't know if it's winning, but I mean, I'll plug you for you. But one of the reasons I'm a fan of LaunchDarkly is because feature flags are ... It's an explicit design decision that you can make in order to improve your ability to navigate a complex system.

If you think about it, in software, we pretend like complex systems are the enemy. Whereas in the rest of our life, I mean, software is probably the simplest system that we as humans still deal with. If you think about the complexity of human interactions or even driving a car, and mentally modeling what other drivers are doing, I mean, this is why autonomous driving isn't successful yet. They can't mentally model another human and their intentions in the other car. It's not that they can't figure it out the physics. That's the easy part.

So, we as people, even as software engineers, we handle complex systems all the time. That's our baseline of just a living. But we don't like to see it in our work for some reason. So, really complex systems are not the enemy. That is what allows us to be successful. But we need things like feature flags to give us the comfort with the complex systems that we're building. We need to know I can make this design decision and then change my mind really quickly if I need to. And so, that's why I see a lot of power and what you all are doing.

Yoz Grahame:
Yeah, thank you for that. Talking about reversibility in that way, and the thing I really love talking about the reversibility on the different timescales. You've got at the project timescale, which is like agile versus waterfall and being able to change direction within a single iteration, within a couple of weeks. And then you've got all the way down to reversibility at much smaller timescales to remediate quickly.

Casey Rosenthal: If you hired a consultant to improve your feature velocity, right, like they would focus at the process level. And that's great. You could do, I don't know, implement agile or scrum or whatever and implement that. But really like what we want to do is we want to make those architectural decisions that are going to improve feature velocity. And it's more comfortable. It's more powerful. It's more measurable. It's like everything about that nicer, right?

Yoz Grahame: Right. Right. Yeah, exactly. It's like we've got the tool. If we could see the tools right in front of us, the process gains are theoretical. They're much harder to grasp. When we see the tools right in front of us and include them in our daily work and being able to flip a flag and see something changed within milliseconds is that wonderful tactile reassurance.

Casey Rosenthal: Yeah.

Yoz Grahame: So, we've got already some great questions coming in. We have from D Parzych, my colleague Dawn, asking, what tips do you have for junior or new employees on where to start with ... Well, this is about actually not chaos engineering, but incident resolution. So, I know you get often asked about how to start with chaos engineering. But let's take a step back and say, okay, based on what you've learned about, especially with the myths, where do you think the most value is in in establishing incident resolution practices?

Casey Rosenthal: Okay, so the place I would start is by studying what the resilience engineering community has to say about incident postmortems or learning reviews or I prefer learning review than postmortem. So, a great place to start would be John Allspaw's material on Blameless PostMortems. And that goes back ways, I think he's got a lot of great material now that kind of moves from just a blameless postmortem is not sufficient.

Really, we want to acknowledge that it's not just that we're avoiding blame, it's that an incident itself is faultless. There are no root causes in complex systems. So, looking for a root cause or even acknowledging that, oh, this was the problem but nobody's to blame for it, like that's also not a great approach.

And so, looking at faultless learning reviews is a much more powerful way to think about incident response management when you think of incidents as teachable moments. And then in terms of just time to remediate, there's plenty of models out there. I don't have a personal favorite. But there are plenty of models that come from things outside of software that work better for different organizations.

So, on the one hand, you have models where there's an incident commander and they coordinate and you develop a group that way. There's research showing that the costs of coordination of an incident response, having an incident coordinator can sometimes be higher than just having the people who do the work normally, the knowledge workers have the tools and capability of making the decisions that they need to being empowered to remediate the incident themselves.

So, there's a lot of interesting research going on there. But I don't have a particular recommendation on the incident remediation. And the incident response management and how you understand that I think there's a lot of conclusive evidence around how to properly run or at least there's a lot of evidence on how to poorly run a postmortem or a root cause analysis session. And those are all indicative towards better processes that won't be wasting your time. I think that was a roundabout way of answering that question, but hopefully, there's something in there.

Yoz Grahame: No, that that's great. That's great. I think definitely the blameless postmortems I think fortunately, that's something that I've seen taken up pretty well. And people like Allspaw really talking about the traditional types of root cause analysis don't really apply here.

Casey Rosenthal: Yeah, his company, Adaptive Capacity Labs, has a bunch of great research on metrics you can use to kind of understand your own plate. Like if people want to read the briefings after a learning review, that's a good indication that you're doing that well. Yeah.

Yoz Grahame: Yes. Yes. Excellent. We have another question, [inaudible 00:42:25] 88 about feature flags, what are good ways to manage dependencies when you start to get more than a few. And I suppose this ties in to ... So, this is more technical and this is obviously something that we at LaunchDarkly have a whole bunch of opinions about.

And I think I might actually expand this out. If you have specific opinions on that, I'd be very interested to hear them. But I think it also ties in to the bigger question about relationships, as a pillar of complexity in that more specifically dependencies. Right. And this is something I've always wondered about is having been at companies that dealt with massive legacy systems and realizing the ludicrous numbers of dependencies, not just, for example, software dependencies, but organizational and event dependencies, all kinds of bizarre things that felt impossible to track.

So, for these kinds of things, how do you advise the people manage all these, these kind of dependencies once they start to expect?

Yeah, so I am not an expert in dependency management. I've certainly seen people bitten by the bug of having too many unmanaged feature flags or Netflix, they call it [crosstalk 00:43:57] properties. I think there's a bunch of interesting approaches to that. But I'm certainly not an expert in dependency management.

Casey Rosenthal: I would say that we should expect that as the layers of abstraction increase beneath us, that navigating that as a problem of complexity rather than solving it, it's probably more useful.

Right.

Casey Rosenthal: Or it's probably a better use of our time.

Yoz Grahame: And I see what you're saying in terms of though it may cloud, it may add the number of relationships, they're there for a reason.

Casey Rosenthal: Yeah. And sometimes the reason might not even be apparent to you or might not even be valuable to your particular goals in a pursuit or an organization or a business. For example, at Verica, we did a lot of research in 2019. And one of the things that we saw across the enterprise market is, let's say some frustration with Kubernetes as this additional layer in the deployment process. For many people, that's not a virtue. And it runs just maintaining Kubernetes as a platform, let alone figuring out new ways to integrate with it is additional work that doesn't advance their particular goals in a smaller context.

So, there's a similarity there to the explosion of dependencies, where if you're focused on the developer experience and dependency management from a language or library or security point of view, of course, you want to limit the number of dependencies. But if you're an engineer who's got different pressures and you're trying to increase the rate at which you're delivering features, then you've got the opposite pressure. Like, let me bring in as many dependencies as possible because I want to take other people's work and make use of that.

So, depending on where you are, you might have a different perspective on the proliferation of dependencies at all. I'm in the camp of inbox zero is a nice idea. But at some point, you reach a rate at which important emails are coming in, where you just have to let it go. And I know right now that approach for dependency management would make a lot of people scream in frustration, and I understand that, but I'm not an expert in that field. So, I should stop.

Yoz Grahame: Well, no, I think it's a fascinating guiding philosophy here, which I think applies to so many people who deal well in complex situations, which is go with the flow. You aren't going to fight it. It's there for a reason. The best thing you can do is get better at navigating it, get better at diagnosing.

This is dear to my heart because I give talks about debugging. And what I find is that too many people don't understand how powerful and easy to use their own debugger is. And also, how valuable it is to build custom tools for your situation. I have rarely been in any situation where a custom tool that was created for a system did not pay for itself within a couple of months.

And so, that kind of evolution of-

Casey Rosenthal: So, that's a mechanism for you to ... Yeah, okay. So, that's a mechanism for you to navigate a complex system better as an engineer. You create custom tool. Yeah, okay, that makes sense. Yeah.

Yoz Grahame: So, a lot of people are going to be wondering specifically about the chaos engineering aspect. And I think we talked just before this and people want to go, yeah, I want to start blowing stuff up automatically. How can I start blowing stuff up automatically? Or more specifically, how can I start applying chaos engineering in my organization? And how can I persuade management to go along with it? And I suppose this depends on how you portray it. But what's the advice you give in these situations?

Yeah, I mean, first of all, I would say, I hear there's a really good book out on this that just hit the shelves.

Well, let's bring that slide up again, there we go.

Casey Rosenthal: Yeah, the general pattern that we see for companies adopting, and it's a widely adopted practice now, but the general pattern we see is start with a game day. There's plenty of ways to do that. But basically, what we want to do is we want to bring people with expertise in a room together to start discussing the repercussions or the system behavior outside of the normal happy path operations. So, just the discussion alone can often help uncover knowledge gaps about how the system actually behaves.

So, game day, you get in a room you go, okay, this piece of infrastructure over here, we're going to take it down for a few hours and our assumption is that customers won't be impacted. So, the hypothesis for an availability chaos experiment almost always follows the form of under X conditions, customers still have a good time.

If your hypothesis is under X conditions, they'll have a bad time, then don't run that experiment. That's just a terrible experiment.

Yoz Grahame: Right.

So, some of the impression of chaos engineering, ah, that's bold, or that's too risky. No, it should not be a risky thing, right?

Yoz Grahame: Mm-hmm (affirmative).

Like, if you think something's going to break, obviously, don't do that. Fix it before you do anything like that.

Yoz Grahame: Right.

Or another way of saying that is chaos engineering is not about causing chaos engineers, or it's not about causing chaos in your system. There's an assumption that the chaos is already in the system and you're trying to surface that so that you can better avoid the pitfalls. So, that's it. That's a great way to get started. Plan a game day or war room scenario where you kind of plan out, and then dip your toe in the waters that way.

It is the gold standard to do chaos engineering in production. But certainly, we recommend starting in a staging environment if you have one. We see a lot of organizations are moving to not having staging environments and that's understandable. And feature flags actually makes that a little bit easier. That's fine. But if you do have staging environments, or hidden features, or different user groups associated with different features, start with experiments there. There's no reason not to.

And hopefully, you can learn from the staging environment, either about your system in ways that helps you make them more robust or you build up your skill set for resilience, or you build more confidence in the system and that you can use that confidence to then move the same tools, the same experiments into production.

Yoz Grahame: Right.

The general evolution we see.

That's a great way of putting it and that ties very well in to that the whole topic of this regular stream, Test In Production, is that ... LaunchDarkly, we have several staging environments. We have a lot of environments. We have a proper staging environment. We did not just ditch it and do everything in production because as I say, if you can obviously reduce risk to an experiment, then you should. The reason to test in production is because you already are. You just don't call it that yet.

So, we're going to have to round this off in a couple of minutes, I'm afraid. But this has been absolutely fascinating. Thank you so much. Before you go, I'm interested to know. So, I'm completely silly. So, there is a thing that ties in to that the point you made about the problem with the bad apple thing, that actually the people who are causing those problems are not doing it because they're bad. It's because they're the ones who are being sent in to do the-

Yeah, closest to most problems.

Right, exactly.

Casey Rosenthal: Yeah, we won't say causing because that implies that they are the root cause.

Yoz Grahame: Right, right. Sorry. You're absolutely right. The bystanders, the people who happened to be nearby at the time, the chief suspects maybe.

And so, I used to work at Linden Labs, where did most of my learning about legacies, complex legacy systems. And if an actual mistake that you made blew up production, then you got to wear the [inaudible 00:54:07] for a day. And similarly, I've seen other organizations do this, but they did a great job of describing a colleague of mine, [inaudible 00:54:16], who is now at LinkedIn, did a great writeup on saying that it's actually a bad badge of honor. It's the you went in to try and do something important and fix something important and you learned. I know at Etsy, they have the three-arms jumper. What are the good ones have you seen?

At Netflix, they give out badges or they give out badges for having taken down production. I know Google had rituals for this. I understand that and I get it. I wouldn't say that I entirely support that because it does still reinforce this notion that one person was a cause. And so, yeah, it's basically coincidental. Sabotage is extremely rare.

And so, if you didn't intend the outcome that was bad, there's no causal relationship in a complex system. So, that's an adjustment in thinking about how an incident happens that's different, like oh, this person published the line of code that brought down production. Well, what you're actually saying is that the least effort to remediate that is to undo that line of code.

But there's alternative narratives to that. It's the fault of the person who wrote the CD tool for not having an automatically stage or it's the fault of their managers for not communicating better or it's the fault of their director for not properly budgeting for another environment or it's the fault of the VP for not allocating and explaining to the board why they need that budget and not choosing a different CD tool, or it's the CTO's fault for not properly explaining the importance and aligning the company right.

So, now, which of those is more likely to lead to a more reliable solution? Putting the three-arm sweater on the person who put that line of code in production or changing the behavior of the CTO, right?

Changing the behavior of the CTO is much more likely to make that system more reliable. And yet all of the RCA-type stuff and the focus on the one person, that's not going to change the reliability of the system at all, right?

Yoz Grahame: No.

The three-arm sweater makes them more comfortable so that they don't carry around the shame and that's important, but it still locks you into that mentality of like, they're the ones who did it, not the CTO.

Yoz Grahame: So, that's interesting, because this points out something that there's a big difference between records analysis and the way I would have got the CO through that is the whole five whys process, saying why did that happen? Why did that happen? Why did that happen, et cetera?

Yeah, the problem with five whys, Richard Cook likes to say, I've got a six-year-old who's great at asking why, we could just put them in charge of incident remediation, right?

Yoz Grahame: Right.

So, the five whys is entirely arbitrary. And the problem with that is, depending on who facilitates it, they're going to create a narrative that suits their purpose. So, if I'm the director in that example, I'm going to drive the five whys into ... And it's not that I'm malicious or doing it on purpose, but I'm going to drive it into, oh, down there among my reports or skip levels, there was a miscommunication.

Whereas if you were truly doing it from the bottom up, the five whys might lead up to the C level. So, the five whys as a practice doesn't actually provide better guidance in how you run an investigation. So, I'd prefer we don't use that as a crutch. But I understand that in some organizations in the hands of a well-intentioned and well-informed facilitator, it can make the investigation more palatable to the organization.

So, yeah, it's all about perspective it sounds like or rather, where you're looking from and then how I can see the dichotomy, the paradox there in the if you're looking from the bottom up, you get maybe more accurate view. But that also means that you have less power to act on it.

Casey Rosenthal: It's a different narrative, right? Like who's to say which one's more accurate or than the other. But in a situation that's hierarchical, well, let's at least say it's intuitive that you're going to change the holistic system much more efficiently if you affect the people at the top of the hierarchy, not the people at the bottom of the hierarchy.

Right. Well, that makes sense.

Casey Rosenthal: And really when we talk about reliability, yeah, we want to look at the holistic system, not the people at the sharp end.

Yeah, yeah. Because the whole point of the [inaudible 00:59:29] as it were, right. If you're getting paid the big bucks to be at the top, that's where the power actually is to fix things.

We would hope. We would hope, exactly. This has been absolutely fascinating. We have to stop there. I could easily talk for another hour or two.

Yeah. This was a lot of fun.

Yoz Grahame: Thank you so much for joining us today.

Casey Rosenthal: Thank you for having me on.

Absolute pleasure. Please do come back soon. I am going to be devouring your book.

So, we're going to be back next week. For those of you watching, thank you so much for joining us. We didn't get to all the questions. I'm afraid there's a good one about knowledge workers and decision making. So, that may apply and be good to talk about that soon.

But thank you. Thank you again for joining us. Your book, Chaos Engineering, which you wrote with ... So, we'll bring the slide back.

Casey Rosenthal: Nora Jones.

Nora Jones. If you're watching, you can get that for free. And we will be back next week. Thank you for the giveaway. And Casey, CEO of Verica.io. Casey Rosenthal, thank you so much. This has been completely fascinating.

Casey Rosenthal: Thank you.

The video will be online. Well, it's going to be on Twitch shortly and then there'll be a blog post in a couple of weeks. And we'll be back next week for Test In Production again. Thank you so much. Thank you, Casey. Talk to you soon.

Thank you so much.

Bye.

Like what you read?

Get a demo

Chaos Engineering and Continuous Verification in Production

Like what you read?

Like what you read?

More about DevOps

Like what you read?