I Don't Always Test My Code, But When I Do It's In Production
How using automated canary analysis can help increase development productivity, reduce risk and keep developers happy
At our March Meetup Isaac Mosquera, CTO at Armory.io, joined us to talk about canary deployments. He shared lessons learned, best practices for doing good canaries, and how to choose the right metrics for success/failure.
"The reason it's pretty hard is because it's a pretty stressful situation when you're deploying into production and you're looking at your dashboard, just like Homer is here. And he's trying to figure out what to do next. Does this metric look similar to this metric? Should I roll back? Should I continue moving forward? Is this okay? And more importantly, what's actually happening is a human being, when they're looking at these metrics in their Canary, they're biased, right?"
Watch his talk below to learn more about automated canaries. If you're interested in joining us at a future Meetup, you can sign up here.
Alright, so a little about me, besides being super nerdy. I have been doing start-ups for the last 10, maybe 15 years, and in doing so, I've been always taking care of the infrastructure, and all the other engineers get to have fun writing all the good application codes. So a big part of just my background is infrastructure as code, writing deployment systems, all the things that you need to do in order to make code get into production, be in production, and be stable. That's a little about me.
I'm currently the CTO of Armory. So this is the latest start-up, and what we do is we deliver a software delivery platform, so that you can ship better software, faster. The important point here is it's done through safety and automation. It's not all about just velocity. It's about having seat belts on while you're driving really, really fast. And the core of this platform is called Spinnaker, which was open-sourced by Netflix in 2015. So by a show of hands, who's heard of Spinnaker? Alright, so roughly half. Alright, are you guys using it in production at all? Oh, zero. Okay. Well, hopefully I can get you interested in using it in production.
Spinnaker was open-sourced in 2015 by Netflix. It's by far the most popular dominant open-source project by Netflix. Netflix has many open-source projects, but this by far is the one that's gained the most steam. It's got support from Google, Oracle, Microsoft, the big Cloud providers. It's also go support by little start-ups like Armory. And so we're all working to make a really hardened deployment software delivery system. It's being used in production by very, very large companies, medium-sized companies, and even some small ones. So you have an idea of the scope and reach and all the people who are involved in this project. It's pretty amazing to see where it came from 2015 to where it is today.
But what I'm going to talk about today is one small component of Armory, which is Canaries. It's all around testing in production. Who is familiar with Canarying? Alright, so roughly half aren't. The idea of Canarying comes from the canary in the coal mine, when coal miners would go into the mine, and there was a poisonous gas, they would die because they wouldn't know the poisonous gas as there. So instead of them dying, they decided to bring a canary down into the coal mine, because the canary would die first, which sucks for the canary but great if you're a human coal miner.
And so the same ideas apply to software delivery. So instead of deploying a hundred nodes into production to replace the old version that you have there before, instead what you're going to do is deploy one node into production, assess the situation, make sure that it doesn't fall over or is reporting dimetrics, and if it's doing well, continue to scale out. And if not, kill the canary and roll back. And so what most people are doing today is that very process. They deploy one node into production. They scramble over to Data Dog or New Relic, and they have some human being looking at an array of dashboards, and then the human being has to make an assessment to actually keep going forward with the deployment or roll back.
But it's actually pretty hard to do really, really good Canaries. And you actually see it as we talk to more and more customers, Canaries, for them, is this pinnacle of a place that they want to get to, and it's pretty hard. The reason it's pretty hard is because it's a pretty stressful situation when you're deploying into production and you're looking at your dashboard, just like Homer is here. And he's trying to figure out what to do next. Does this metric look similar to this metric? Should I roll back? Should I continue moving forward? Is this okay? And more importantly, what's actually happening is a human being, when they're looking at these metrics in their Canary, they're biased, right? You got that product manager breathing down your back to get that feature into production. It's a month late. So even if they kind of look a little bit off, you're still going to push that button to go forward. You're still going to make mistakes like old Homer here.
So who here has kids? Yeah, so you guys are all familiar with this kind of crazy scene here, which is just kind of an out of control situation. And this is what a lot of deployments are like in general. So in order to do a nice, good, clean deployment, you need to have control of the situation and the environment. And when this is happening, there's no way to do a clean Canary, because when somebody deploys something and it's a mess and you're trying to compare metrics, and you're just unsure, where did that metric come from? Did we deploy it correctly? Did somebody change the config on the way out and it's not reproducible? You end up in this crazy situation. And most people, or most customers or most companies, their deployments, in fact, are just unstable. So the idea of doing a Canary on top of that, doing some sort of analysis, it becomes very, very untenable.
So why Canaries in the first place? Why test in production? Why not just have a bunch of unit tests and integration tests? And the reason why is that integration tests are actually incredibly hard to maintain, to create. They fail all the time, and they're very brittle. And engineers don't like writing tests in the first place anyway, so it's unlikely that you have that many of those. But these are easy, right? The unit tests provide you a lot of coverage. They're easy, and as you go up the stack here, the tests will become harder and harder and harder to do, and that's why you don't want to write as many of them. And you want to rely on other systems, like Feature Flagging and Canarying to help you out at this top part of the stack.
The other thing about an integration test, each individual integration test that you write only provides you incremental more coverage. You're not going to get so much more coverage that an integration is ripping through the code base all the way in and all the way back out. You can only do that so many times, right? So they're hard to create. They're hard to maintain. They're brittle. And like I said, I mean, we all know engineers don't like writing tests in the first place. So it's a constant battle between devops and engineers on the integration tests.
So then why not only just do Canaries? Well, the problem with only just doing Canaries and getting rid of the unit tests and integration tests and all the other tests, is that you start finding out that there's huge problems right before you get all the way into production, at this line right here. And that's a very costly place to find out that you have problems. So while Canarying will be helpful and useful, it's more of a safety net. It doesn't replace integration tests. It doesn't replace unit tests. You still have to figure out how to get those engineers to write integration tests, which is difficult to do. So it's a combination of all these tools that help you deploy safely into production.
So what makes a good Canary? So the first one is that it's fully automated. And again, it's got to be fully automated. The more manual process that you introduce into your Canary process, it doesn't add any safety. You just have more human beings, and we all know that human beings are error-prone. Literally, as engineers, we build software to automate other human beings out of work, but we won't do that to ourselves. So we constantly continue to put our human assessment into the situation, into the release process, and that's pretty painful.
And in fact, actually, a story of a customer that I had, where they had a manual Canary release process, they put out a node, they started putting out more nodes, and this engineer got hungry, so she went to lunch. And she went to lunch for about an hour and a half, and in that time, she took down a very, very large Fortune 100 company site, losing millions of dollars. Again, there's no rollback mechanism. It was a fully manual system, and so there's no point in having the Canary if there's no way to actually automate the process.
The third through the fifth part are about the deployment system itself. It needs to be reproducible, meaning if you run the same Canary twice, you should have the same result, roughly. Obviously, because you're testing in production and it's live traffic, it won't exactly be the same. But you generally want it to be roughly about the same. It needs to transparent. You need to know what's happening inside of the Canary, like why is the Canary process deciding to move forward or roll back? Because if the Canary system makes a mistake, and the engineer needs to know why it made that mistake so that it can be corrected. And the last thing is reliable, right? There's a lot of homegrown systems out there in terms of deployments and software delivery, and what ends up happening, if it's not reliable and you have that one software engineer that's gone rouge and is going to start building their own Canary system on the side. And now you end up with three broken Canary systems instead of just one. So this is what makes a good Canary system.
So what are some of the lessons we've learned doing Canaries? The first approach is like, let's apply machine learning to everything, because that's just the answer nowadays I see. But actually, simple statistics just works. And this actually comes back to the previous statement that I was making about it being transparent. The moment you apply machine learning, it turns a bit into a black box, and that black box can vary degrees deepening on who made it, how it was created. And it actually doesn't really get you much. So what we instead learned was that using simple statistics, we apply now a Mann-Whitney U test to the time series data that we get from the servers, from the metrics, so that we can actually make a decision as to whether to move forward or back. And this is much easier for someone to understand and comprehend when the Canary fails. If it's a black box and I don't know what to do and I don't know why it broke or what metric was actually failing, it's of no use to me as an engineer who wants to just my application code out.
So another lesson we learned is that this is what we see normally when people do manual Canaries. And I think it makes sense when you do a manual Canary, because it's simple to orchestrate, you have your existing production cluster, you roll out your one node in the Canary cluster, and then, again, you scramble to Data Dog or New Relic or whatever metric stat store you're using to just start comparing some of these metrics. But this is like comparing apples to apples, except one of them is actually a green apple. And it's like slightly different, and you don't really know why. And it just doesn't really make sense. But they look kind of the same, so screw it, let's just go to production, right? And you wonder why the Canary failed. But again, you can't have a human making decisions about numbers and expect it to be rational.
So instead, what we do is we grab a node from the ... Well, actually we don't grab the node. We grab the image from the existing production cluster. We replicate it into its own cluster called the baseline cluster, and then we deploy the Canary. Because now we can compare one node to one node. They are running different code, which is the whole point, but we're comparing the same size and metrics. Like if this side is over here ripping through 100,000 requests per second, there's no way that this is going to be able to be compared to that with simple statistics. So this is what makes also Canarying harder, is that you have to be able to orchestrate an event like this, and that isn't trivial to do if you don't have sophisticated software delivery.
So other lessons we learned is that a blue green deployment is actually good enough for smaller services. You don't need to Canary everything. The blue green is good enough. You release it. If it fails, and alarms start going off, you roll it back. The impact is actually low. And the thing I didn't mention is all that we're doing here is trying to figure out how to reduce overall risk to our customers and to the company, right? And sometimes blue green is enough to just actually reduce enough risk if you have that out of the box.
The other thing that we've found is people like to do Canaries without enough data coming out of their application. If there is no data coming out of an application into Data Dog or whatever metric store you're using, you can't use that information in order to have a good Canary. It's just impossible. I can't make data up. And then the last one is choosing the right metrics are important. Each application that our company has has a very different profile. What is means to succeed or fail is very, very different whether it's a front-end check-out application versus a back-end photo processing application. You know, you might want to look at revenue. Revenue might be a great metric to look at for the front-end customer check-out application, but it has no meaning to the image processing application. So making sure you understand what it means to succeed or fail inside of your application is really important, and it's very surprising to see how much people don't realize what it means, and they only learn through failed Canaries to understand what success or failure looks like.
You can do this with Armory. So this is what a pipeline looks like with Armory. I think these might ... Yeah. So this is what a pipeline looks like with Armory and Spinnaker. Is everything familiar with the term baking? No? Okay, so you guys are familiar with the term AMI, an image? So the idea of an image is just an immutable box in which you can put files and all your code, and you take a snapshot of your computer at that one time, and it becomes an image that you use to push into production. That term has been coined, I think by Netflix, bake. So it's called baking an image. I have no idea why it's called that actually, now that I'm thinking about it.
And then the next step, when we run a Canary, and what the Canary step will do, we'll actually send out ... We'll grab that node from production, take the image from production, create a new cluster, and then push out the Canary as well, the new change set, and then run the statistical analysis against it for as long as you configure it. And I'll show you what that looks like next. You can also see that we also create a dashboard in whatever metric store that you're using, so that again, back to the transparency, if this thing were to fail, you're going to need to know what metrics were failing. So we automatically create a dashboard, so you have that transparency.
The reliability here actually comes from open source, the fact that there's 60, maybe even 70 developers now just working on this. You've got the world's best Cloud engineers building this software, so the reliability comes in the fact that a ton of engineers are working on it. They're not just going to leave to go to another company in another six months for higher pay. We're working on this, and it's going to be around for a very long time.
And so those are the properties that you see right there. This is what it looks like to have a failed deployment. So again, back to each application, we give it a score. An 83 might be a good score for a lot of applications. For this one though, for whomever configured it, decided that an 83 is a failing score. And so what a failing score will do will actually destroy the Canary infrastructure. So it will destroy the baseline and the Canary, and then that should return production back into the state that it was at automatically for you. So you don't have to do anything. If you get hungry and you go to lunch, you can trust that this thing will take care of it for you.
So this is what it looks like to config a Canary. So you can choose the data provider. We allow you to config where you look for the metrics and alarms, what's going to set this Canary to be good or bad, how you want to do the metrics comparison, how long you want to do an analysis period. For instance, there's some applications that you only want to do a one hour or two hour Canary, and that might be okay to reduce risk, to get an understanding of how the application is going to behave. There's other applications that are mission critical, and you may want it to run for 48 hours, right? So every application will have a different property and will have to be configured differently.
And then the metrics that are associated with the Canary, like what do you want us to be looking at. As you start querying, as you're running in production, we'll start looking at these metrics. Interestingly enough, we used to have a deviation. That actually no longer exists. We automatically can detect now if you're falling out of bounds. You don't need us to tell us what a good score is anymore. We just apply more statistics to the time series we get. And you don't have to apply deviation or any thresholds. We kind of automatically figure that out for you. So the engineer doesn't really want to be doing this type of work. They want to get back to writing application code.
So the thing that we get a lot is why Canaries vs Feature Flags, and it's not a mutually exclusive thing. They're actually used together. And in fact, we actually use Feature Flags all the time with our Canary product, and the way that we use it is whenever we build a feature, we put it behind a Feature Flag immediately, and we're constantly deploying to production behind this Feature Flag. Everything that we do is under continuous delivery. We're a continuous delivery company, so we obviously should practice what we preach. So we use a Feature Flag, and then we start iterating. The Canary helps as other commits are getting pushed into production, that you're not just affecting anybody blindly. You're not actually making mistakes. And then especially on this last part. It's really great for code changers that affect large, large portions of the code base, like a database driver change. I don't know how many times I've seen deployments fail because of somebody just switching a Flag, and next thing you know, production is out.
So provide info for every release. If you're doing a Canary, it's giving you a good or a bad for every single release. Spans typically a single deployment. You're not typically going for multiple deployments, although you can be able to do Canaries. And then Feature Flags are great for user visible features. It kind of lends itself to the product team, where a Canary is more of an engineering activity. There's not a product manager that's going to do a Canary, at least not that I've seen. And then Feature Flags are great for testing against cohorts and more sophisticated analysis. Canaries aren't always that sophisticated to be able to test against users. A Canary doesn't know necessarily what a user is in the first place. Alright, so that's it.