• Overview
  • Transcript

How Progressive Delivery Enables us to Chaos Engineering

Ana Margarita Medina Gremlin

Experiments let us observe, discover, learn and create change. Ana will share how you can start using Chaos Engineering for development, testing, and deployments to build more resilient applications.

Downloads slides

Ana Margarita Medina

Ana is a Chaos Engineer at Gremlin in San Francisco, CA. Previously, she worked at Uber as an engineer on the SRE and Infrastructure teams, where she specifically focused on chaos engineering and cloud computing. Catch her tweeting at @Ana_M_Medina about traveling, diversity in tech, and mental health.

ANA MARGARITA MEDINA: Hi, all, thank you for coming to my talk. My name is Ana Margarita Medina, I'm very excited to be here. I would like to thank you all for coming to my talk, I would also like to thank the Trajectory Conference team for making sure that we still have a conference amidst this pandemic and making sure that we pivot to a virtual conference.

Today, I'll be talking about how progressive delivery enables us to do chaos engineering. Today's talk is about buzzwords. Buzzwords next to buzzwords. Are we excited yet? Well, actually, no. Today's talk is going to be about experimentation and being customer focused. So let's start defining these so-called buzzwords. We'll start with chaos engineering. Chaos engineering is the science of performing intentional experimentation on a system by injecting precise and measured amounts of harm. This is done to observe how the system responds for the purpose of improving the system's resilience. Chaos engineering can be done in the application layer and on the infrastructure layer. The other buzzword we have to define today is progressive delivery. Progressive delivery is the technique of delivering changes first to small, low-risk audiences, and then expanding to larger, riskier audiences, while validating results as we go. Progressive delivery can be implemented in various ways, such as canary deployments, feature flags, and gradual rollouts.

 My name is Ana Margarita Medina, I'm very excited to be here with you all today. Feel free to connect with me on Twitter after today's talk. I would also like to give a shout out that I'm a proud Latina, I carry the flags of Costa Rica and Nicaragua with lots of love and pride. I started writing code in 2007. This got me started in writing code for the frontend development. Then I transitioned to backend systems and somehow found my way into building mobile applications for Android and iOS. But in 2016, I started working as a site reliability engineer, and this also got me to learn about chaos engineering. And I totally fell in love with this field. So I'm very excited today to be talking about some modern development process technologies. Why should we do progressive delivery? Why should we do chaos engineering? And why should we do both? Well, first, we have to talk about how complex our systems currently are. We have seen that various years ago, as we started transitioning to the cloud, we started adopting DevOps technologies. And for many organizations, this also meant that we started adopting microservice technologies.

 Things are only getting more complex, which makes things more difficult to operate. The pressure for faster innovation is driving the adoption of new types of infrastructure, application architectures, and development processes. When we look back at our legacy applications, we see that we have hundreds of servers while we're running on-prem, but we only had one service 'cause we were running on a monolith. And then the way that we did rollouts followed the waterfall process, which meant that we had only one annual release.

 We lifted and shifted, we rearchitected our applications. And now we're in this cloud native space, where with DevOps technologies, we've been able to adopt daily releases. If your organizations have also adopted microservices, you might be overseeing not just one service, but hundreds of services, maybe even thousands. And then if you've also decided to adopt Kubernetes, now you have hundreds of thousands of resources to look for, whether it's containers, pods, or network layer infrastructure. It is a tough fact to face, but as we know, our software is gonna break. And the world we're building relies more and more on the stability of naturally brittle technology. The challenge we face is how do we continue innovating while delivering products and services to our customers in a way that minimizes the risk of failure as much as possible. We know that our businesses, health, and safety rely on applications and systems that will fail. So what are we gonna do about it?

 I come from Silicon Valley, today, I'm recording my talk from San Francisco. And one of the things you always hear about in this space is that we wanna move fast and we wanna break things. But what if I tell you that there's a better way to adopt this mindset? And that is how about we go ahead and we build things, we break them, and we verify that they're resilient while we're still moving fast? So how do we go about doing that? Well, progressive delivery and chaos engineering is one of your paths. And as I mentioned earlier, this is putting a focus on experimentation and on being customer-focused. At the end of the day, we have to embrace failure as humans, as organizations, as companies. It's okay to break things and it's okay for things to fail. But we have to engineer for them to happen on a small scale. When we let failure affect our customers, whether they're internally or externally, that is not okay, that is not acceptable. But it's not only just that, it's also very, very expensive.

 How do we engineer for that? Well, we go ahead and we implement experiments, chaos engineering and progressive delivery are driven by experimentation. Experiments let us observe, discover, learn, and create change. We do this as individuals, we do this as teams. We do this together as a company. In chaos engineering and progressive delivery, we always consider what we are impacting. Progressive delivery and chaos engineering have been created for you to experiment safely. And one of the ways that this actually gets done in organizations is that both practices actually have a concept called blast radius. In chaos engineering, we talk about how many servers, containers, microservices, and applications are affected in our experiments. In progressive delivery, we talk about blast radius just as often. We define the environments we're gonna implement progressive delivery in, in example, for canary deployments, we identify how much of our traffic we're gonna route to a canary deploy. For feature flags, we do things like go ahead and identify what users we would like to target in our experiment group. But both practices also take this a step further. With chaos engineering, you also have the ability to define something called magnitude.

 Magnitude is the impact that your experiment is injecting. For experiments in the resource layer, that could be something like the percentage of CPU increase that you're about to inject into your system. In experiments like latency, this can be how much milliseconds of latency you're actually injecting. And in progressive delivery, in the example of feature flags, you narrow down the scope to just a few users, and this can be defined as internal employees, maybe your users in Colombia or a company that's about to beta test your new product. Another way they both enable you to experiment safely is by providing you the ability to stop these experiments. Both these practices are about doing this controlled, planned, and intentional. 

 In feature flags, we have things like kill switches. This turns off a feature flag. In chaos engineering, you go ahead and you define some abort conditions. Abort conditions are the things that are going to cause you to halt, stop an experiment. And when you're doing chaos engineering, you also wanna do this on a platform that has the ability to halt an experiment maybe halt all the experiments that you're running for that certain application. Failure is going to happen. Are you ready to fail is what you wanna ask yourself and your team. These two practices put a focus on risk management, preventing outages, and preparing for failure. Does your team know what to do when a failure happens? Do they know where to look, where to find the dashboards, where the runbooks are? How do they escalate to their manager, how do they communicate with their leadership team that there's an outage going on? And how do you go ahead and you communicate with your customers that you're currently having an incident? 

So let's put an example. We have an engineering team made up of Maria and Jose. They're working on a large feature that they're ready to launch. This feature mostly applies changes to the front end. And there are some backend things that are also changing, but there were zero changes to the infrastructure. They feel pretty confident about this and they're starting to deploy it. But uh oh, their pager goes off. They broke their staging environment. Thankfully, they had implemented continuous delivery. It wasn't production that was affected. This thankfully only affected staging, but it still sucked. They talked amongst themselves, they looked at their dashboards. They talked about this with the rest of their team, it didn't make sense. Their test coverage was 99%. It worked locally, but they were unable to understand what happened from the build logs of their CI/CD tooling. Their monitoring dashboards were not showing anything out of the regular.

They iterated on it, but they were scared to deploy again. I still don't feel confident in rolling this out again, Maria said. They knew, because they had implemented CI/CD, they would be able to catch those failures if it was to happen, but they still weren't sure, that confidence was completely missing. Well, Emilia is one of the site reliability engineers in their organization. And she is also the embedded SRE for Maria and Jose's team. She wants to help her development team feel more confident in their deploys. So she decides to ask them, have you considered progressive delivery, specifically feature flags? This would allow for you to start building your confidence in your deploys. Their reason that Emilia is really interested in enabling Jose and Maria to use feature flags is that she wants additional exposure onto the system resources to understand how this feature talks to the rest of the services. She is curious to know how the resources on the application and the servers are being handled. Even though the dashboards are showing everything's perfect, things are not matching up with that canary environment completely being broken. So they go ahead and talk about progressive delivery. They start talking about the blast radius only affecting the service that they're  currently implementing this on.

 Then they also go ahead and talk about specific feature flag terminology, and that includes things like user targeting. User targeting allows for them to control who's seeing this feature. With kill switches, they have a toggle to disable the single feature flag that is a bug in the case that things start going sideways during a release. And they've also implemented some fallback values that, until the connection to the feature flag service is made, they're gonna use this code. So they're now ready to redeploy. Their feature flag is on, and as they have created this experiment, they have decided for the user targeting to be those internal employees. That scope of the failure is gonna be a lot smaller and they're gonna ask for feedback as they do this. As they're ready to implement this, another engineer found out that this weird resource issue had broken the staging environment, and their name was Ana. They were very curious to see what was going on. Ana came to the team and asked them, have you looked into implementing chaos engineering alongside your progressive delivery? We're gonna set up some game day time, we're gonna set up a working session. Let's go through and actually look at those build logs, let's look at the code of your application and try to understand what happened so we can actually craft some thoughtful chaos engineering experiments around this. Ana sat down with the team and talked to them about the blast radius, the magnitude, abort conditions of this experiment. With the blast radius, they wanted to discuss the amount of hosts and containers that were being targeted in this experiment. With the magnitude, they wanted to be thoughtful of the intensity of the attack that they were about to run. And they needed to discuss those abort conditions, what was going to cost for them to actually stop this experiment? So when it came to implementing this, they decided that the blast radius was only gonna be scoped out to that new service. Thankfully, they had also just implemented progressive deliveries, so we have been able to inherit that blast radius from progressive delivery. Then when it came to the experiment, because they're having the most questions around the resource layer, they decided to unleash a CPU experiment.

 The magnitude of that experiment was that it was going to inject 50% of CPU and later scaled up to 60%. Because they have to define some abort conditions along their hypothesis of this experiment, they decided to say that they will stop this experiment if they see any 400, 500 errors within their application and if they saw their error rate continuously go up along with their system and container resources spike up to a level that was not common. They've decided to implement their chaos engineering experiments and they were able to see that their chaos testing was successful alongside their progressive delivery roll-up. They felt way more comfortable with this deploy. At the end of the day, the system and the application handled the failures with no issues. And alongside their working session and their game day, they also realized that their dashboards were not looking at the right metrics for the containers.

 So they were able to roll out some improvements within their dashboards. And because they had enabled their internal team to beta test his feature with a feature flag, the team received great feedback on the new feature. They still had some time to implement that feedback prior to production launch. So Emilia, Maria, Jose, and Ana were able to test this feature with the rest of the organization and successfully roll it out to everyone. The leadership team was really happy, their management team, and this really put together the way that our engineering teams can actually work together. Some of the wins from this scenario include things that they were able to deploy reliable software, their development team, their SRE team was confident in the way that this incrementally increased and rolled out to production. They beta tested with a small subset of customers, even if that was internal customers, their feedback still mattered very much and they had time to implement it. They also went ahead and improved their monitoring. And they realized that they had no runbook for their services, so they also created that, those are two huge wins that many organizations sometimes oversee. They iterated and continued implementing that feedback that those internal customers gave. So they were able to launch successfully and make their external customers pretty happy about this.

 So when we look back around the topic of progressive delivery and chaos engineering, we see that these practices are best implemented across the pipeline. So you wanna do this on your development environment, you wanna to do this on your testing environment, your deployments, to gradually roll this out to production. We can run these, as I said, in development. Then we can go ahead and do that in testing, staging. Maybe we wanna do this alongside our QA team until we grow that confidence to run these in production. And the way that we gain that confidence is that we start small and we build on those small successes incrementally, as we gain the confidence that our systems are more reliable and that things are less likely to break in a way that is going to affect the customer. At the end of the day, this is for our customers to have resilient user experiences. But this is also for our developer teams, our ops teams, our site reliability engineering teams, for them to get feedback earlier in the development process and prior to the full launch of their features and their changes. When chaos engineering and progressive delivery are woven together, we have modern software development. Let's go ahead and let our engineering teams spend more time building and creating value and less time firefighting and managing risk. Failure is going to happen. So you also wanna go ahead and run those chaos engineering experiments on everything. That also includes your feature flag platforms, your CI/CD environments, maybe even on the people systems part of things. If this talk interested you in learning more about chaos engineering, come and join the chaos engineering community. Head on over to gremlin.com to join the Slack that has over 5,000 folks adopting these modern DevOps technologies from folks that are just getting started in these practices to maybe some that have been building resilient systems for five, six years. 

I would like to thank you for coming to my talk, I would like to thank Trajectory Conference for putting this together. Feel free to reach out to me via social media or email if you have any questions. And if you're interested in trying Gremlin's free product, head on over to go.gremlin.com/ana. Thank you.