Chaos Engineering at Datadog
In August we hosted our Test in Production Meetup at the Meetup headquarters in NYC. Corey Bertram spoke about how Datadog does chaos engineering. He shared his experiences from when he led the SRE team at Netflix, and how that's influenced the way Datadog put process around chaos engineering experiments.
"This has led us internally to rethink what chaos engineering is at Datadog. We're moving from a world of game days to experiments. We want to be very explicit about what these experiments are. This is really taking the next step and saying, "What do we want to get out of this? What are we looking to measure? What are we looking to test," and then how can we do it safely?" – Corey Bertram
Watch to learn more about the processes Datadog has in place around their chaos engineering experiments. If you're interested in joining us at a future Meetup, you can sign up here.
Corey Bertram: Hi, I'm Corey. I'm the VP of infrastructure at Datadog. I've been there about two years. Datadog, if you don't know it, is like a SAS offering to do infrastructure and application monitoring for engineers. We actually do a lot more than that now. We're APM, tracing, logging, et cetera. Check us out if you don't know who we are. We're pretty good. I like the product. I was a customer for three years before joining. It's pretty aces. Our tools provide a surprisingly deep level of insight, into your applications for your engineers. As a result, it makes things like chaos engineering far easier to build and adapt and grow. It's in the DNA of our company to be thinking about performance, and monitoring, and resiliency.
Why should you listen to me? Unfortunately, you can't talk about chaos engineering without talking about Netflix. I spent three years at Netflix. While there, they're very famous for their chaos engineering efforts. It really spun out of the SRE team. I was the soul SRE for a year and a half. They're doing interesting work. You might have heard about Chaos Monkey. Pretty rudimentary looking back on it now. It's like it comes through, kills one of your instances, one of your nodes, et cetera. Very old school, but really had a huge impact at Netflix as they were scaling up and early adopters of the cloud. The cloud environments that they were going into were fun, would be the nice way to put that. Easy to in the early days was not the most stable platform, so it was kind of a mess.
The goal was to simulate a loss of a system so that they could build this adaptive environment, so that they didn't really have to worry about resiliency as much. As those continue to scale up, the limitations of Chaos Monkey really started to rear its head, of like losing a node is a fine, but what happens when you lose a quarter of the nodes, a third of the nodes? The entire region? Then what? That led to my two big contributions there, which were the creation of both Chaos Gorilla and Kong. Chaos Gorilla, which is on the left, simulates an AZ failure. It really is an AZ evacuation tool internally. It's how they shift traffic when Amazon loses one of their AZs, which happens very rarely nowadays but it can happen still.
Eventually, it became clear that that wasn't enough. They had a global ELB outage in 2013, 2012, I can't remember. That really forced us to rethinking about multi regional availability and resiliency, which led to Kong, which is the region evacuation script. Unfortunately at this scale, it's always interesting to watch demos. A lot of people don't necessarily have these problems at scale. They just don't operate at this scale. It's something other to start moving a million RPS between regions, in real time, while scaling up and adding resources, when sometimes the cloud provider doesn't have those resources. You can't just ask for ten thousand machines. They're like, "Wait a second. That doesn't exist in the data center." It's a really interesting problem.
One of the things that has become very clear to me after talking about this the last almost decade is that what works for them most likely won't work for you. Netflix has an interesting problem in that a lot of their services are stateless. It makes this really easy to do. You can go blow up stateless applications all day long and you can just load balance across new resources all the time. It turns out when you throw a CDN at a lot of your problems and caching at a lot of your problems that a lot of your problems go away. Unfortunately, not everyone has that.
There are other types of failures that you should be thinking about. It's not just about hardware or resource overloading. It can be network latency, network flapping, buggy software. At the end of the day there is a human problem here. Things break. Input variance is a huge one. Retry storms, also a huge one, especially at scale. What happens when suddenly you have a million customers all suddenly DDOS'ing? It's a nightmare. Race conditions still happen all the time, and dependency failures. Like dependency treason. Some of these distributed applications are mind boggling.
It became very clear that we need to go a step further. There were some internal efforts at Netflix to do this but at this point the community was really catching on and pushing for it, which kind of leads me to Datadog. I joined the company about two years ago. When I joined, as a customer you fall in love with a company and then you show up and you see how the sausage is made and you're like, "Oh shit." They were doing some chaos engineering. They work. They have their own Chaos Monkey of sorts. It wasn't as bad as it could have been. They were growing, going through an enormous amount of growth.
Today, when you look at the ecosystem, this is our service map. At scale, it needs some love. We're working on it. We have a lot. We just have a lot of interconnectedness. Everything talks to everything. Things get a little bit crazy. When you're dealing with hundreds of services at massive global scale, across data centers, across regions, across clouds, things get very, very complicated very fast. The failure scenario has become very complicated, very fast.
I think it's important when you're talking about chaos engineering to also talk about culture. A lot of companies that are like, "We want to do chaos engineering," and then they invest all this time and money and they build this team, and the team is incredibly ineffective because the organization is just not ready to actually do it. That is not the case at Datadog. Datadog is the most chaotic environment I've ever worked at. I love every minute of it. It is fucking fantastic. Some people would truly rip their hair out if they had to work there though. It is not for everyone.
In the last two years since I've been there, we've gone through an immense amount of growth in terms of we went from thousands of machines to tens of thousands of machines. We've completely containerized the environment and gone to Kubernetes. We run at least one of the world's largest production Kubernetes installs as far as I can tell right now. As I mentioned, the infrastructure footprint is orders of magnitude bigger. We've also exploded the breadth of the product, from adding logs as a product, synthetic load testing, real user monitoring. The list is fairly large. The product breadth continues to expand, and everything is talking to everything, and dependency has become more complicated, and everyone wants to be a platform team, and so everything relies on everything now.
Everything is storing state. That is the other thing here. 85% of our fleet has some type of state on it. That makes all kinds of testing incredibly difficult. You start blowing up machines and you potentially cause outages, and pretty bad ones in some cases. State, it turns out state at scale is very hard to refresh at times, because there's only so much bandwidth in a data center to dump that many GIGs of memory somewhere.
The other thing I think is important to think about is what is the goal and is the organization aligned with the goal. For us, trust is everything. Resiliency and availability are built into who we are, because if we're down when a customer is down it leads to a really painful conversation, it's one we just want to avoid. This is near and dear to us and so we want to make sure that we're building resilient systems. Some of our customers don't bother. They literally don't care if they go down for an hour, and that's totally fine because it really doesn't have an impact on the business. It would matter to us if we were down for an hour.
When we talk about chaos engineering, one of the things I think is important to know is two years ago Datadog was starting small. By this I mean schedule game days, go talk to teams, don't try and automate everything, don't try and turn off everything. Go after the easy wins. Go after your stateless applications and just get the win. Start to build that cadence and make people understand, build that internal communication and feedback loop. If you look at chaos engineering, like .1 for Datadog, we had done about 160 game days over the course of two years. Our Chaos Monkey was like a Python script in AWS Lambda. They would come through, kill some nodes. About 80 services were scheduled to automatically run this. This still technically runs but it's on its way out. Over the course of those two years we killed like ten thousand instances. Great. Awesome. Cool. What we were finding was it wasn't really giving us a ton of data. It wasn't what we were looking for. It wasn't actually increasing resiliency outside of the lowest barrier.
The other thing I tell people is to understand your steady state. Fortunately, I work at Datadog, we monitor everything. We know what our steady state is. If you don't have monitoring, if you don't have alerting, just do that. Just get some level of visibility. Don't even bother with this shit. Don't even bother to try and test in production, because you already are because you don't know what the hell is going on. You need to understand the steady state. I really almost wanted to come up here and just rage at the room for a long time when they were like, "What do you want to talk about?" I was like, "Testing and production is fucking stupid. Everyone does it and they don't know what the hell is happening." Monitoring your systems is very, very important, and I think you have to ...
This is where the SLO, SLI initiative from Google has really ... there's a lot of flaws with it I think but in general speak, it gives you a start point to just say what is the impact of this service. What does it do and how do users interact with? I think developers need to understand that because in a lot of these esoteric failure scenarios, engineers don't even know they're happening because they're like, "Well, all my stats look good." It's like, well, that's because you don't monitor the end user. The user still gets a 500, but yeah, you're right. You're still processing data. It doesn't matter.
This has led us internally to rethink what chaos engineering is at Datadog. We're moving from a world of game days to experiments. We want to be very explicit about what these experiments are. This is really taking the next step and saying, "What do we want to get out of this? What are we looking to measure? What are we looking to test," and then how can we do it safely? We have a hypothesis, a method, and a rollback. We've tried to codify what this means so that we can continue to push the envelope of what are we testing.
Then to be able to also get better feedback loop of are we making things worse. We once again looked at this list and we're like, "All right. How do we codify this?" We have these levels. For us, level zero is just can you kill stuff, can you actually run Chaos Monkey, can a node disappear, can a POD disappear, what happens? Surprisingly, as we layered on Kubernetes there's abstractions on top of abstractions on top of abstractions, and everything is way too fucking complicated and no one understands how anything fucking works anymore. Losing a POD can actually cause your cluster to explode because things try to reprovision and then continues to explode and everything. There's a lot of interesting stuff that comes out of this, so it's still worth doing.
Then taking the next step, and for us this is the biggest right now. A lot of our ... as we expanded as quickly as we did and as the company continue to double in size year over year, not every service has timeouts, not every service has circuit breaking. There's just a lot of lack of diligence, is what I'd call it, in our services. I know that again, we're also dealing with duplicated state across machines, across data stores, especially custom build data stores. We're trying to look at not only partial, but total loss of the network. What does that do? Network segregation, packet loss, latency. Then DNS, which is the fucking bane of our existence. We are testing the hell out of DNS nowadays. We have a really cool talk out there, Kubernetes the very hard way. We have like nine slides in there of just all the ways in which DNS completely screws us. It's good. Go find it.
Then we're taking the next step and we're looking at CP and memory and disk and noisy neighbors. These are things that are less common, and so we didn't really prioritize them but we think that this is the next step for us in terms of things we should be testing for. This comes into performance and bin packing and noisy neighbors. We usually were our own worst neighbor, quite frankly. We are of a size where we're like we don't really share a lot of machines with any other customer, and realistically they're usually just all ours. This is the next kind of level for us.
Then finally, we're looking at the distributed systems. We're figuring out, all right, if we lose an AZ, if we lose a fucking region what happens? How do we do this? I'll be very honest with you, level three, AZs we kind of have but it's only because we're lucky. Loss of a region is still something that is actively being developed right now. The tests for it are going to be very complex and very scary. Don't jump into it. Don't even bother. Start small.
How do we do this? Unfortunately, this Python script, Lambda, not really going to cut it in this day and age, especially as we move to Kubernetes. There's actually this really interesting project called Chaos Toolkit. My SREs looked at it and they said, "We can build on top of this. It's a framework for building chaos engineering tools. We've invested very heavily in it, quite frankly." The big goal here is that especially, again, as our footprint continues to grow, as the company continues to expand, more teams, I can't keep up hiring SREs with the great growth of the company. It's just really fucking hard, especially in New York City. Everything we're doing is trying to strive towards self-service. It's no longer, "Hey, let's chase you for game days." It's, "Hey, we ask you do you have experiments on your project?" On a blast day, every Monday, like, "Hey. You need to add experiments. Hey, you should be testing this. Hey, this service should be level zero. You should be level one. What are you doing? Are you moving? Then we can do reporting and data collection around that across the entire ecosystem.
This is the general workflow. We do use Spinnaker. Ignore that. It's fine. That doesn't matter. We created, with Chaos Toolkit, we had this chaos controller, which is a CRD on top of Kubernetes. We will probably Open Source this in the coming months. Basically you can trigger a job in Spinaker that will reach a CRD and spin up a chaos POD. Then that chaos POD will either do network attacks or kill the POD or kill the node. It's fairly clever, honestly. In terms of what the end user needs to do, this is it. Say, "Hey, we're level zero. Here's my health check." They're off to the races. This is as bare bones as it can get. It can get more complicated than this but generally speaking, you can just clone something like this and run.
Then the nice thing is is that because we own the CRD, because we own the PODs, we can metric the hell out of everything. It took us two years to kill ten thousand nodes. This is just the last month we killed ... we've run four thousand experiments across the ecosystem. We don't run them on weekends. We're able to create some rules around this so that, hey, I'm not getting paged because we turned something off, because we were running some shit. Then the other thing here is we're able to do even more engineering, of hey, if we have an active incident going on stop running this fucking thing. Just feature flag something and stop running experiments. We do that as well. This gives us a chance to ... who are the teams we should be talking to? What's going on? This dashboard is actually much, much longer than this. There's a lot of data here, but it's kind of interesting. We're keeping an eye on what's failing. Obviously, there's a noisy neighbor here and we've got to go clean it up. There's a noisy one here but whatever. It's fine.
What's next for us? I think this is such an early stage. When we think about where Kubernetes is, where users are, the goal here is that this needs to be easier. Again, you can't rely on your SREs to be the ones to chase people for these. They don't fucking care. The SREs will get overloaded, it is demoralizing. You have to find a way to incentivize users to see the value in these things. Making them on call for their service is a really good start. Go do that. For us, I think we're striving towards that level three, level four. We're thinking, "All right. Once we get there then what?" It is probably not only multi regionally but what happens when you have 12 regions? Can you start to dynamically shift traffic on the fly? I never want to be woken up. That's the goal here, is I'm too fucking old and I'm too busy and I just want to sleep. Realistically, how can we continue to get these things in a state where things just auto heal?
The other thing I will stress here is if you're thinking about doing chaos engineering, if you're not already doing it, don't let your engineers be assholes. It's like really, really important. It's the easiest way for the rest of the organization to turn on you, for you to lose any kind of momentum here. This work is important, it does provide value, it does make things better. Don't go shoot a service in the head that you know is going to cause an outage and take three days to fix. That makes no fucking sense. At Netflix, I can remember teams years, our Cassandra team waited years before they were even willing to run Chaos Monkey, nevermind Gorilla. There was just a white list and it was fine, and we would hound them all the time for it. In reality, we still left them on the white list and they weren't actually moving. In the event of an actual failure we probably still would have been down. Everything else would have been great. Yeah.
Then yeah, don't just hope that things will just work themselves out. Don't just hope that the Kubernetes community will save you, because they fucking won't. It's not good software. Don't depend on it. It's getting there, it's getting better, but it's the best we've got right now. Don't just hope that this stuff is not a problem for you. Pay attention to it. You might not be at a scale where you need to be doing it but you'll quickly get there. There's a lot of really cool tools out there that are just easy drop-ins, so keep an eye on it at the very least.
As always, we're hiring. If you want to work on this shit, come on here. It's fine. We have big, hard fucking problems. That's all I've got.
Audience: What's up, Corey?
Corey: What's up, dude?
Audience: Could you speak a little to some of the tools that are currently available, like Cube Monkey, Powerful Seal, what's your opinion of these?
Corey: A lot of the older tools, especially Chaos Monkey and some of the stuff coming out of Netflix is deeply, deeply tied to Spinaker, and so it's kind of hard to hold those apart, which is why we ended up building a lot of our own. I think the Chaos Toolkit is the one project out there that I've seen that is like aces and going in the right direction.
Audience: Is that tied to ...
Corey: It's not tied to Spinaker. I think the work we're going to Open Source and some of the stuff we've done upstream will likely continue to make it fairly abstract. We're probably going to move away from Spinaker, in the very near future. We're making sure that the tools we are building and investing in are fairly lift and shift for us, because we don't want to be tied to this stuff. We're always keeping an eye on the community. If there's some really great work that just shows up out of the blue, we want to be able to use it right away. Yeah, Chaos Toolkit right now is super ace.
Audience: Oh, cool.
Audience: Hi, I actually listened to your previous talk about the Python script, to terminate using Lambda. I think in the headquarter, the New York Times. My first question is in terms of this chaos engineering culture, how does this play between your team with all the teams? I know that you mentioned that there are game days. How does the product team prioritize the stuff that you created, or the problems you created?
Corey: Sure. Again, we went through a fairly significant shift when I joined. I moved us to a you build it, you own it model. Engineers are on call for their services. When their services fucking break they are miserable, which is nice, because I don't like being woken up again. That helped quite a bit. People started being like, "Why is my service failing? What is going on? What happens when we inject latency?" Then they were more eager to come to us and schedule a game day so that we could at least start down that path. Now that we've made a lot of this tooling self-serve, they're already doing it. We are running reports internally, we keep an eye on who's doing what. We'll maybe proactively go out once a quarter and talk to a team and be like, "Hey, use launch a new service. We realize it's pretty brutal. Maybe we should do something."
It's always about the error budget, the SLO, the SLI. We use that a lot as well to dictate no more product work. Your service continuously is down and we can't get three nines or you can't even get two, in which case full stop, here's what we're going to do. Let's throw some testing at this and see what happens. Then the automation of the chaos engineering or the chaos tooling outside of that holds them true. It does regression testing by default, basically. It's pretty nice. Again, I work at an engineering company building an engineering product for engineers. I don't have to pull that many teeth here. People tend to be pretty active about these things and pretty proactive. Maybe my job is a little bit easier than everyone else's right now.
Audience: Okay. Actually, my second thing is about the stack. I was chatting with Gremlin about they'll take on the service movement like you know, service is getting really big. You're not like list service manager, so this technology Lambda stuff, right?
Corey: Mm-hmm (affirmative).
Audience: They don't actually have very concrete commercial solution for that kind of problem, scope, right now. In terms of your Chaos Toolkit, how does it play with the service stuff?
Corey: We don't use service mesh right now. We run dozens of Kubernetes clusters in production. We're using Psyllium as a foundation to do cluster-to-cluster communication and transport encryption. We're looking at Istio pretty actively but we've also been talking to other Ash Corp folks. I think what they're doing with console right now is interesting. It's a little more simple and more easily usable by the engineers. It'll probably be STO just because there's some stuff we would like in terms of visibility, and functionality, and circuit breaking, and whatnot. We're looking at them and we're evaluating them. We run a flat network across all of our clusters right now. It's fairly complicated and it's clusters of clusters. We have this meta cluster concept. Happy to connect you with the teams if you want to dive deeper there. It gets fairly in the weeds fairly quickly. Our network stack is a little insane. Generally, by running a flat network right now, we don't usually have to worry about this at that level just yet. When we do introduce the service mesh things are going to get more complicated. Like significantly more complicated.
Audience: Yeah, cool. Thank you.
Audience: Yeah. We using actually very recently we started onboarding Datadog. Yeah, amazing product really. I've been a big fan of Datadog for a number of years, for my work. It's like a financial firm. It's been quite a task to get Datadog onboarded. One of the questions I have, if you don't mind me asking, is how big is the SRE team?
Corey: At Datadog? 15 right now globally. We usually don't embed. In certain use cases we will. We'll go sit with a team and work with them, but I'm not building a knock.
Audience: You guys are mostly on consulting basis for the other teams?
Corey: Yeah. My SRE team looks more like a software development team than it looks like assisted team. Some of them have that background but in reality most of them are software developers. The goal here is to build tools that empower everyone else to do their job better, not to do it for them, not to be a crutch, certainly not to be a knock.
Audience: Sure, sure, sure.
Corey: It's 2019. If I have to wake up and run a run book, we're already fucked. We've already blown through our error budget, we're already at two nines. It's like a nightmare. In reality, I just think that approach is not sustainable. I have this incredible monitor tool and alarming tool that should let me auto correct and auto react in seconds versus, "Let's page you," and then I'm going to sit there and hit sleep for four minutes until I actually get up, then go get my laptop, then login, and figure out what's going on. It's fucking insane. We'll never hit the availability network, or staff around the world. Follow the sun, and then someone is just sitting there staring at a graph 24/7. That's insane. We have a monitoring system. We have alerts. Let it do its thing. Script the hell out of everything. Again, it's 2019. It's not 2003 anymore. I'm a little opinionated about this. Sorry.
Audience: Thank you.
Corey: Yeah. Cool. All right.