A Key to Success: Failure with Chaos Engineering
Test in Production is back! In May we hosted the Meetup at the Microsoft Reactor in San Francisco. The focus of this event was the culture of failure. Specifically, we wanted to hear how the culture of failure (avoiding failure, recovering from failure, and learning from failure) has an impact on how we test in production.
Ana Medina, Chaos Engineer at Gremlin, spoke about how performing Chaos Engineering experiments and celebrating failure helps engineers build muscle memory, spend more time building features and build more resilient complex systems.
"Very much of trying to build that courage, trying to be okay with the idea that failure's going to happen and just being okay with your teammates failing and celebrating those failures, realizing you can learn a lot from these failures." - Ana Medina, Chaos Engineer at Gremlin
Watch Ana's talk to learn more about how Gremlin's approach to running chaos experiments helps everyone learn and build more resilient systems. If you're interested in joining us at a future Meetup, you can sign up here.
Host: All right, is everybody ready for a little bit of chaos? It's time for our next speaker. I'm really excited about this, I love the idea of failure with chaos engineering and how chaos can be a key to your success. Coming up is a current chaos engineer at Gremlin. Gremlin helps companies avoid outage by running proactive chaos engineering experience. Coming from Uber, and coming from the SRE and infrastructure teams, focusing on chaos engineering and computing, now at Gremlin, we have Ana Medina.
Ana Medina: Hey y'all, thank you for coming out today, thank you for sticking around for the last talk. Very excited to talk to you all a little bit about failure, and about chaos engineering.
As I mentioned, my name is Ana Medina. I work at a company called Gremlin as a chaos engineer.
Humans are fallible. Fallible means capable of making mistakes or being wrong. With a raise of hand, who here has made a mistake before? Yeah? Yeah, mistakes happen, right?
Well, I want to talk to you about actually changing the way that we see things right now. I want to talk about creating a culture that teaches us that failure is okay. So I believe that inclusive cultures embrace and inject failure, and that's a little bit of what I'll be talking about today.
There's a quote from a philosopher, Friedrich Nietzsche, who says, "Even the most courageous among us only rarely have the courage to face what he already knows." Very much of trying to build that courage, trying to be okay with the idea that failure's going to happen and just being okay with your teammates failing and celebrating those failures, realizing you can learn a lot from these failures.
So, what does that have to do with anything? Well, today I'm going to be talking about chaos engineering. With a raise of hands, how many of you here have heard of chaos engineering before? Cool. Yay. Very glad to hear that chaos engineering is being talked about in testing production. It goes very much hand in hand.
So to recap a little bit, what is chaos engineering? Chaos engineering is thoughtful, planned experiments designed to reveal the weakness in our systems. This is not about just running experiments hoping everything goes well; this is about doing experiments in a very thoughtful and very planned way. So when folks hear chaos engineering, they're like, "Wait, no, I don't want people just breaking my production because they can." And it's like, no, we don't talk about just going in on Monday and saying, "Let's do chaos engineering today in production," and have folks either pull out cables, or just shut down host, or do all these other stuff of chaos, because that's not very planned nor thoughtful.
So, at Gremlin we've been talking about this analogy that chaos engineering is like a vaccine where you inject something harmful into your system in order to build an immunity. So that means actually preparing your engineers to be ready for those failure scenarios.
What are the principles of chaos engineering? Well, we first start off with planning an experiment. Then we think about containing that blast radius, and then following up, you either want to expand that blast radius and scale it, or you actually want to squash that chaos engineering experiment and go fix the issues that you uncovered, and run that chaos engineering attack again.
So to give it a little bit more of a graphic, this is what we call with ... blast radius. You first start off with a very, very, very small initial test, then you see that chaos engineering experiment expand, and maybe you see that the chaos engineering experiment has been successful, and you actually go ahead and expand that blast radius. The idea is that if you actually don't know what one chaos engineer experiment is going to do to two of your hosts, why would you run it on 100% of your fleet? So you actually want to start off in a very thoughtful way of like, "I'm going to start off with one host, one container, just one service, and then continuously scale that blast radius up." And that also means hey, maybe I don't feel comfortable enough to run this chaos engineering experiment in production. Well, we can start off doing this in our testing environment, in our QA environment, and then later growing that chaos maturity to actually be able to run chaos engineering experiments in production.
The other thing that we need to consider when we talk about blast radius is thinking about some abort conditions. What would cause me to actually stop this chaos engineering experiment? Some of that actually might include when my error rate goes up, when I see users having error rate, or my error rate on the API calls is continuous going up, I want to go ahead and actually stop that experiment. Or if my users are having a lot of latency, let's say your running this on production on an iOS app and you're seeing that a chaos engineering experiment that you had by just shutting down some containers now means that images are taking around five seconds more to load up on the user screens, and they're just used to having that in just a few milliseconds. Maybe that means that you actually want to go ahead and actually stop that chaos engineering experiment.
But one of the things that you need to consider when talking about chaos engineering is that you want to make sure that the platform, too, that you're using has a big red button, and that means that you're able to press that big red button in case you actually see something on the user experience side, on your monitoring or observability that shows that your chaos engineering experiment is not being successful.
So the way that we actually go about doing chaos engineering I mentioned was very planned and very thoughtful. The way that we go about doing that is by using a scientific method. So we go ahead and we actually form a hypothesis. We go ahead and say, "If I shut down my container and Redis, I believe my Redis replica will get promoted to primary. I will suffer no data loss, I will continue having a good experience, and everything will run smooth because you had a good Redis configuration."
Well, you go ahead and you actually perform a chaos engineering experiment on that hypothesis, and I actually did this for a talk that I gave last November at AWS re:Invent. And I ran that chaos engineering experiment, I shut down my primary container for Redis, and well, I ended up realizing that the configuration out of the box for Redis basically meant the replica container of Redis looked at primary and saw the primary one empty when it got shut down, therefore it itself became empty. And then it got promoted to primary, suffering a complete data loss.
Of course, if you were running this in production, having a data loss will cost your company a lot of money, and you will have very unhappy customers. So you don't necessarily want to get to an extreme case of doing something like that. So you want to think about doing chaos engineering first in a safer space to actually be able to verify that you have configured your things properly, that things are getting promoted and you're not suffering data loss, that this is showing up in your monitoring tools, observability tools, that engineers are getting paid to take action. So you actually don't find out that your customers are really unhappy when they just can't access any of their data on the application anymore.
Then, once you see this chaos engineering experiment be successful or actually fail, well, you actually want to share those results. You want to share those failures. You want to tell folks, "Hey, by the way, we ran some chaos engineering experiments last month and we actually detected that we weren't configuring our applications properly, so this is the steps that we have taken to make our applications maybe more resilient, more reliable." So, sharing those results really plays a big part.
This is a blast radius once again to kind of bring it all back together. First we run that first initial chaos engineering experiment. We analyze those results, we see that everything was successful, therefore we expand the blast radius and continue performing chaos engineering experiments.
Now I wanna talk a little bit about the culture of failure, the way that I have been seeing failure happen and more in the sense of I've been doing chaos engineering for about three years. I actually joined Uber into the chaos engineering team and was able to learn how to build an internal tool for it, was on call for that. It was a lot of fun, learned a lot. But now that also made me realize that I wanted to join a company that was helping other companies perform chaos engineering themselves.
So, I'll start off with talking about game days. At Gremlin, we proactively are embracing failure and we're constantly telling our customers to do this. Of course, we're telling them to use Gremlin to do this stuff. But one of the things that we do is we that we do ... we run some stuff called game days, and that basically means bringing together a team and talking about your architecture diagram and thinking of what could actually go wrong. We actually have a whole bunch of resources that Gremlin has put together on how to do game days, these are templates for agendas, templates on what are some of the chaos engineering experiments you can run, some email templates, and stuff like that. But the idea is that y'all come together and think about architecture diagrams, and think about how you have built stuff and where things can actually fail.
This is also a great time to look back at post-incident reviews and think, "Hey, we actually had an incident happen a few months ago and we actually want to make sure that we've taken all these action items and actually have put it in place." So you actually perform chaos engineering experiments on those same conditions that you covered in your post-incident review.
So, at Gremlin we have actually been doing this. We actually use Gremlin on Gremlin. We definitely do it for the dogfooding purposes, but we also want to make sure that we're being really resilient and showing folks how to be resilient. So about two months ago, we actually put together the roadmap of what our game days looks like, and kind of getting folks with the idea of like, "Hey, you want to do chaos engineering, but you don't necessarily know where to start?" The way that we have built the game day roadmap for this year is actually looking at the SRE book, and thinking about the hierarchy of the way that you build stuff with the SRE mentality.
So the first fundamental layer of it is making sure your monitoring and alerting is set up properly. So we already have that blog post out now that says we ran a game day to verify that our alerting, our monitoring was set up properly at Gremlin, and we did that on staging about, I think last month, so we saw, we uncovered a whole bunch of stuff. We were actually able to learn more about our monitoring tool, which is Datadog, and just last Friday we actually got that same game day, and we ran it in production. We were just trying to verify that we have our monitoring for production running properly, and that we're actually getting pages, and even the questions of, "Hey, when folks get this page are they actually getting a link to a run book, or something that makes them actually take action?" And this is really cool to see this interaction.
The other thing with the way that game days are run is that you can bring anyone together to these game days. So when I'm telling folks that they want to bring chaos engineering into their companies, we're very much telling them, "Hey, think about bringing some interns and some mid level engineers, and bring in your architects, that way you can also share knowledge and be able to think about let's learn together, and go over the decisions that we have made as a company."
The other way that we're constantly embracing failure and able to do chaos engineering is by doing something called Takedown Thursdays. Fun fact, we used to call them Failure Fridays, but as we've continued growing the company we realized that running chaos engineering experiments on Friday evenings were becoming really hard. We're a remote by default culture, so folks on the east coast were staying up a little too late on Friday, so we accommodate them to be Takedown Thursdays. And the way that we do Takedown Thursdays is actually more of we're launching a new feature, we're doing a new rollout of something else, so we all come together, mostly product engineers, and they think about different ways that things can break. So not only are you using the application, we're also running chaos engineering experiments on the back, and that could also be including some low testing on the side, or just in general preparing for high traffic events.
The other thing that I like talking about with chaos engineering that kind of promotes this culture of failure is by doing the postmortem outage reproduction using chaos engineering. So that also means looking at the post-incident reviews and then thinking, "Hey, I actually want to make sure that the engineers have gone in and fixed what they found to be the quote, unquote root cause." And those conditions end up being formed with the chaos engineering experiment, and you're able to verify that you took those action items.
But with that conversation I also like saying that in part of this culture of failure, we have a lot of folks that have been sharing this postmortems at other, larger companies, and we can still learn from them. You can go over on GitHub and find a whole bunch of post-incident reviews that have been talked about publicly, and you can take this maybe as a fun activity to do with your team of like, oh, go read one and think about how that could actually apply to your infrastructure, your application, and your services, and then form a chaos engineering experiment with those conditions and perform it. That way you also are able to think about, "Hey, these folks are sharing their ideas on failure, let's see how that actually applies on my scenarios."
Then, on-call training using chaos engineering. Well, how many of y'all got onboarded to on-call by just getting thrown a pager and saying, "Hey, you're on, deal with it." Maybe have a run book. That was actually my story. I got put on-call my third week at the company without ever having production and knowledge or any systems experience, so when that page came in my second week on-call at 3:00 in the morning and we're having a production outage, looking at that run book was scary. I was primary on-call, 3:00 in the morning, you kind of didn't have ... that company didn't have necessarily a culture that was okay to escalate.
So I freaked out, and I was like, "Okay, I know we have this run book, we have all these dashboards." I went to go look at them, and ran what the run book said to do. But I was so scared. I was very much like, "Oh, maybe I come in tomorrow, I just don't have a job." And that wasn't pleasant at all. It also sucked that that run book was around 180 days old, so having stale run books just made things a little bit more hurtful when you're on-call. And looking back at it is very much of I wish I would've onboarded to on-call by actually having some ... a little bit of training of what to do, psychologically, the way I was going to deal with anxiety, to actually being able to put in commands on more of a fast-paced environment, to communicating with folks what I'm actually executing on my host, to where I'm looking at my dashboards.
So thinking about hey, let's come together and run these chaos engineering experiments to train, whether it's your intern coming into a rotation, to your staff engineer joining a new team and trying to go into a new production service. It's a great way to come together and learn either past failures or think about new failure scenarios that can happen.
To recap, it's okay to fail. I want us to fail together because we can learn together. Inclusive cultures create environments where failure is embraced because we understand humans are fallible.
So, I wanted to leave y'all with the link if you're interested in learning more about chaos engineering there's a chaos engineering community with over 2,500 engineers all over the world. These are great ways to connect with other engineers that think about testing and production to actually have been running chaos engineering experiments for a few years now and sharing their learnings and stuff like that.
And if you have any questions, this is the best way to contact me, it's on Twitter or email address. I'll also make a pitch, Gremlin is hiring, we're hiring for engineering manager, head of people, front end engineer, and a Windows engineer, so if you want to do chaos engineering on Windows, we're looking to build out our solutions to also cover Windows. Apart from that, we also have a chaos engineering conference that's going to be happening in San Francisco, so feel free to come by and ask me any questions.
There's also a free edition of Gremlin's product that came out a few months ago. So, didn't necessarily put a slide on that, but if you go over to gremlin.com, you're able to sign up and actually start running chaos engineering experiments in just five minutes. We offer two chaos engineering experiments to run on your infrastructure; that includes shutdown host, shutdown container, and maximizing resources which in this case is a CPU. It's a great way to actually verify that your Kubernetes environment is actually pretty reliable and holds true to the container space, or that you actually have monitoring and alerting set up, or autoscaling set up if you actually increase your CPU.
So, thank you, and if you have any questions, there will be a mic.
Kelly: Yeah, absolutely. Ana, thank you so much to talk about keys to success, failure with chaos engineering. If you raise your hand we'll bring you the mic. Do we have any questions in the room? Over here, we'll start over here?
Audience member: Hey, this is on, all right. So, most operations, people already have so many places where they know that their system is horribly broken, and they already have so many priorities, like, "This needs to be fixed here." How do you split the time between destroying your systems even worse and actually remediating them? How do you justify spending time breaking them when you already know, yeah, they're already horribly broken and I'm already trying to fix it?
Ana Medina: Yeah, I mean it definitely comes up of a case very much of like, "We have enough chaos, we don't want any more chaos in our systems, why would we intentionally do that?" And a lot of it sometimes applies of those priorities and having project managers that are understanding their resiliency is really important to applications or services, so some of the ways that we have actually been tackling that with our customers is by having these game days. We come in when we're on sales parts with companies, and they allocate three hours where they come together and break some stuff. This could be on your dev environment to your production environment depending on what you're up to. But it's very much of uncovering those failures and then still embracing that failure culture and then having those discussions of, "What are the Jira tasks or outcomes that we see that we actually need to promote to be P-zeroes, and we actually allocate for it?"
And the way that Gremlin actually has been tackling a little bit of that is by ... we actually, the way that we do on-call rotations when you're an engineer that's on-call, you actually don't work on your project that week, you actually just work on an entire on-call different Jira tasks that are created, and some of them end up being some resiliency stuff, or high-priority tickets that really affect our customers.
So the ways that we try to think of resiliency as part of our product, and we try making those P-zeroes always.
Audience member: We have a bit of shared history that's not relevant; I'm an Uber SRE now, so I have this question. What's chaos engineering exactly testing? Is it testing the infrastructure, or is it testing the services and the products?
Ana Medina: So you can do it ... the way that I've seen it of course, like at Uber when I was doing it there, we had uDestroy, and that was only for infrastructure. But it goes up to doing just much more than that. You can do it up onto the application layer. So in terms of Gremlin, we actually now only not necessarily only do infrastructure chaos engineering, we also have application-level chaos engineering, and that you actually ... the way that we have it right now is that we have a Java library that you implement to your application, and you're able to wrap around calls and make your own targets on that. So you can also think about being just more thoughtful and planned.
This was actually the way that this application that we call is ALFI, application level fault injection, this came out of that our CEO had actually built it when he did chaos engineering on Netflix. And he had that idea of like, "I want to run chaos engineering experiments, but I want to run it on the PlayStation that's sitting at my house." So how do you actually become so much more specific on your chaos engineering experiment, and with a product like ALFI, now we're actually able to create targets, and not necessarily be more specific onto what EC2 instance we want to run on, what Lambda we want to run on, and be able to call it out that way. And the same way could be applied of if you actually just have an incident that only affects your Android applications, and you want to actually do this chaos engineering experiment, you're actually able to flag it of only look at the devices that are Android and not iOS in that case.
Audience member: Thank you to you. I think it's very intriguing that clearly you have a culture built around chaos engineering and especially that you are embracing inclusion as part of the ability to learn from chaos. And I'm just wondering, do you see that that attitude of embracing failure, embracing learning, embracing iterative learning through failure and chaos, is that applicable beyond engineering?
Ana Medina: Yeah, definitely. I mean, we actually ... when we do our game days, we ... it's an open invite to the 55 people we have at the company. So we have folks from sales, folks from marketing sitting in on our game days, and they're also learning. The way that we actually ran our game day last week in production is that we actually had someone on sales sitting down and looking at our dashboards, and actually taking notes. Yeah, the engineers are kind of guiding them on what dashboard to look at, what the spike means, but they're actually trying to immerse themselves into this engineering space. But this is completely applicable to everything in life which is like ... it was very interesting, because I was like ... I could go on and on in this conversation because the only way that I've learned is by failing myself, and the way from career stuff to education to mental health, I very much am a strong believer and there's not way that I'm going to get past this unless I actually share with folks and have that transparency part of it. But it's very much of telling folks it's okay to fail, and it's okay to share your failures and come together and learn from it.