Right Grid
  • Overview
  • Transcript
Trajectory

Doing Deployments at Scale

Michael McKay IBM

This session will talk about how the IBM Kubernetes Service builds and deploys micro-services to Kubernetes clusters across the globe. Using standard tools and services such as Github, TravisCI, Kubernetes, and LaunchDarkly, IBM is able to deploy hundreds of code changes daily to thousands of clusters spread across the globe. By changing our development culture, using feature flags to control our deployments, and bringing our environmental configuration under control, see how we transformed our deployment pipeline from a slow monolith to a fast and agile group of micro-services.

Downloads slides

Michael McKay

Michael is an Executive Director at NCR. His job is to drive modernization and culture transformation as the SRE lead for the Commerce Platform. Prior to that, he was with IBM for 25 years. Michael is married with 4 kids and 3 cats. Needless to say, he comes to work for peace and relaxation.

Heidi Waterhouse: Michael McKay has come to us from IBM and he's here to talk about how IBM cloud is doing their transformation and how they're going to make it work for them. And I think it's really interesting to listen to people who are dealing with a large organization and a small group and how they get that to work together. So Michael is here from Raleigh ... yeah, from Raleigh and his plus was his daughter, which I think was really cool. Bring your kids to more tech conferences, they're super bored, but that's okay it's good for them. So thank you for joining us, Michael. Give him a round of applause.

Michael McKay: Thank you. All right, so this could take me second to get this technical stuff set up here, which means plugging in my laptop. Entering this thing up. This supposed to be in here?

Heidi: Yeah.

Michael McKay: Yep. I am ready. All right. My name is Mike McKay as Heidi said I'm from Raleigh, North Carolina. And only recommendation I have for this conference next year is, let's have it a couple months earlier so I can escape the North Carolina weather, because we had some really nasty weather a few weeks ago. So what we're going to talk about today is how we build and deploy our IBM Kubernetes service and how we kind of transform from our old practices of building this big enterprise monolith, to new microservice based architecture where we are rolling out continuous changes daily. And I'm going to try to summarize this in six easy steps.

Sorry, I'm trying to learn the clicker here as well. All right, so a little bit about myself, the only bullet points you need to know is one I've been with IBM for a really, really, really long time it seems. Honestly, I didn't realize how long it was until I look on LinkedIn and it told me I had been here for 22 years last year, and today or next month is my 23rd anniversary at IBM. The last bullet point is I've got four kids, one dog and two cats. Because of that any problems we've heard at work no matter how big the challenges, what kind of issues we run into it's never going to be as bad as a 12 hour trip to Disney with four kids in the car. So that's one of the reasons I want to make that point.

So what is IBM Kubernetes service? Obviously it's the ... we're hosted Kubernetes. I'm sure most folks in this room are familiar with Kubernetes is. We started off about five years ago now, back then we were doing mostly containers and then that's technology transformed. We adopted Kubernetes, maybe I'm running with that sense, we're part of a much larger organization called IBM cloud. IBM Cloud brings in all the other aspects of IBM, you've all heard about Watson, about the big data and analytics many different services for IBM cloud and we're happy to be a part of that.

The internal name of our project of IBM Kubernetes services is called Armada. So one of the things you'll understand about Kubernetes is everything has this nautical based naming structure. And so when we come with new names for every components, we have this Wiki page we go to that literally has like 1,300 nautical terms and we just go through that list until we find something that sounds right. And so that's how we ended up ... I don't know if it's how we ended up with Armada, but that's definitely how we ended up with our internal project called Razee, and I actually pronouncing it wrong. This is probably the American way of saying it, Razee I guess is the actual pronunciation. But Razee just sounds cooler to us and it's easier to come off the tongue.

In addition to just hosting Kubernetes, we also do private docker repositories and we also do image vulnerability scanning and a few other services to help build and maintain your Kubernetes environments. And the last thing I wanted to mention is that Razee itself is my squad within the organization, and we're responsible for the transformation of our tribe overall, the total tribe that delivers Kubernetes service. And so we build tooling, we help set standards and processes on how we build and deploy all of our services in the IBM Kubernetes service.

Drawing Armada. When I first started in the group, probably about two years ago, we had one data center in Dallas. So because we had one data center in Dallas, things were nice and easy. We push the code out to development, we tested it and development, we've pushed it out to stage, everything works good on stage and we push out to one production environment that were great. So we had what other folks build, we had a bunch of Jenkins jobs and we're really proud of the fact that we built this really big product almost on itself just to build and deploy our code and roll it out into production. So again, we started off as one data center in Dallas, we have since grown now to I think we've got 35 data centers we're running in. So we have 35 production data centers that we run all of our micro services in and that's spread across six regions in the world.

The architecture of Ramada. So each one of these regions is comprised kind of a hub and spoke model. So we first deploy a hub, a hub contains all of our micro services and there's probably around 25 or so micro services per hub. And each hub can support up to a certain amount of managed Kubernetes clusters. Once we've reached capacity on that hub, we'll add additional spokes and that's how we scale out each of our regions. So we support two different types of clusters, we support what we call cruisers and cruisers are... If anyone goes to the IBM cloud and say hey here's my credit card give me a cluster we'll spin you up a cruiser. So the cruisers you configure them however you want, you can have bare metal machines behind it, you can have virtual machines, you can have lots of machines, not so many machines, big, small you get it. We also have patrols. Patrols are like hey I want to try this thing out, Give me a free cluster. And so we'll support, we actually support running our free clusters in a separate environment we call patrols.

You think pushing a button would be much easier than this. All right, so some of the challenges we have. So we have 60 clusters just on our control plane alone. So we have not only our production environments, but we also have developments, we have some staging environments, we have some pre staging environments. So all together we have around 60 clusters and those 60 clusters for control plane we have a little over 1000 or so different software deployments. So these are different pieces of software deployments we need to track, we need to understand which versions are running there, we need to know when they're updated and we need to know when there's issues around those particular deployments.

In addition, we've got tens of thousands of managed clusters. So every time any one of you go to IBM and say hey give me a cluster we give you a cluster, but on that cluster is a small bit of IBM Code that runs just helped us manage it, collect some information about it and do things like in grass backups and things like that. So these are the kinds of challenges that we have and on top of that just like every other project seems that IBM, we've got our team spread across the world. So we got some folks in Raleigh, we got the folks at Rochester or in Austin, Bangalore and honestly I'm probably leaving out a few places, but you get the picture. So we have this really big challenge of managing lots and lots of deployments in clusters and environments spread across multiple teams.

So the first step of our six step process is first of all admitting we have a problem. Luckily for us this wasn't too bad because our deployments are taking about three hours, we're only to probably deploying once a week, once or twice a week if those were successful. In addition, we typically spend more time fixing the deployments than we did to actually doing deployments. And what would usually occurs is that on Tuesday Jake or Jeff would do the deployments and then the whole group would probably spend about the next three or four days fixing all the problems that happened with these deployments. So in order to fix, come up with step one we had to admit we had a problem, but like I said, this is easy because everyone realized we had a problem ever and also realized that we would not be able to scale to where we're at today until we fix these issues that we had.

So admitting we have a problem was step one and that was an easy fix. The next up we had to do was helping to fix the culture, and I know Adrian actually mentioned some of these things in his pitch, but this is definitely a big problem that we had. So a lot of our developers, actually most of the team had spent their entire careers building software, building packaged software, literally building stuff that we would build, we would test over like a six, three to six to even a year long period. And so breaking away from these habits and mentalities that we'd built up around product development really had to be broken. So this is the first change that we had in the culture.

The second thing is that we're all engineers, we love to build stuff, we love to build. The bigger the better, the more complex, you know, the cooler it was. So that was another challenge that we have to fix this part of the culture. So part of that is when we started our transformation, we identified what we want to culture to be like. The first thing is we want visibility and transparency for all what this meant is that you know, no more private get out repositories and that is not just our own tribe, that's an IBM wide phenomenon. So we have GitHub enterprise now on IBM and we really preach social coding and openness for everyone to see each other's progress. Finally, visibility. One thing we realized is that we don't know enough information on what's going on our environment to make rational decisions. And I'll to that in a bit here on how do we fix the visibility issue.

The next one here is minimizing friction. And actually I got this from my old boss who was sexually since moved on, but is was a case where we tend to put more pieces in place to help slow things down rather than the speed things up. So our previous pipeline was based on Jenkins and again we're very proud of this. So we have all these checks in place with the build would take three hours a deployment would take another three hours, and we're really proud of the fact that we can catch all these errors and all these issues before we actually deployed. But what happened is that even though it will be deployed, we got all these hopefully things pass through tests, we'd still have outages, we'd sell breakages. So we moved away from a model of trying to slow things down to a model where we actually want to speed things up.

So we were really on a focus on how can we deliver quickly. we want to deliver in minutes, not hours, not weeks, not months, but minutes. And so we made a bunch of changes to how we build our products, how we think about what the IBM Kubernetes service is on his own instead of thinking of it as one monolith that we deploy at once, we now start to think of is a service comprised of 25 to 30 or so separate micro services. And part of that is... I'm actually skipping on order here, but is doing decentralization of the code itself. Not only the code but the teams. when I first joined we treated Armada just like one big thing. So we actually even versioned it, we said hey, here's the amount of version 25.1 and that's how we'd build that, we'd rolled into development, people will test and development. We'd roll it up to stage and eventually would make it a production.

So now we've got production running at 25.1 and along comes the UI. They've got hey, I want to make this button blue, not green. We've got to go through the whole separate... we've got to go with the whole deployment process of building up and deploying everything just to make that one small change, so that was not very efficient. In addition, each one of the teams their sole responsibility was coding their UI, coding their API, coding the billing code and once they were done, they basically handed off the fence and we had a separate team who would be in charge of building and deploying that thing. And the problem with that is that the teams who are building the code had no idea of what the stuff was looking like running in production.

So one of the big cultural shifts we had done was to redistribute the roles and responsibilities of the team. So now every squad is responsible for not only building the code, testing it, they're also responsible for deploying it and running it. So again, I think it's Adrian pointed out as developers learn they're going to get called in the middle of night they tend to write better code. They also tend to be much more wary about when they decided to push code and what kind of code they push out. The past model of having all these checks in place led to teams getting a little bit lax on the type of code they'll check in so they would check and stuff thinking okay I'm going to get this little piece of code in, it will run and stays for a while and by the time before it goes to production I'll have time to make you make the additional fixes. So now since each team is rolling upper production directly on their own and much more rapidly, we tend to have much less of that type of a development mentality. So now they're realizing that anytime they make a change, anytime they commit to master that that code most likely will go to production.

So and additional things is standardization. So even though we're saying we're kind of having the teams be more autonomous and do things on their own, we do have throw some standards in there. For example, we require all of our images to be built and tagged with a GitHub commit id. We also require our teams to use Travis Ci for building. So now we're not using, one team's not using Jenkins teams using Travis and other teams using Circle Ci or whatever may be. We really standardized in some of the tooling and some of the best practice that we've learned. And finally, simplicity. Let's just keep this as simple as possible. Our previous processes we thought the more automation the better which is good up to a point, but sometimes it's better for a developer just go and actually click a button to do deployment rather than having this thing automatically done when it feels the need to do it. So we've tried to keep simplicity definitely part of our equation here. And by the way, simplicity is an ongoing battle. It's amazing how developers will just increase, continuously drive more complexity without being reminded to try to keep it simple.

So step three of our process was actually understanding what was running. So keep in mind, this is back when we only had one data center in Dallas and we thought we knew it was running because Jenkins said hey we deployed version 25 so we figured we got version 25 of Armada API running in Dallas. So Lo and behold, that wasn't the case because there's a big difference between what your CD process says it deployed and what's actually running in the environment. So we found that folks were... the Jenkins job may have failed so you may not be running the version you thought was just deployed, you may have gotten a developer went and kind of monkey patched a new version directly into the environment.

This is three or four years ago and in the basis the net result is we really had no idea it what was running in the environment. So to fix this issue we developed a cool dashboard, we call it RazeeDash. So what this is, it's an agent that we run in everything single one of our Kubernetes clusters and it's job is to report on what's running on the cluster. So now we know exactly what's running on every single cluster in our environment, and I think some of the screenshots were here, I think we've got about 700,000 or so Kubernetes resources that we track. Not only do we track what's currently running, but we can also track history. So we can have a detailed list for every microservice when was updated in the past history of all the updates. So we can look to see hey our Armada API was updated six days ago and before that it was updated eight days before that, two days before that, five hours before that.

And it's really cool not only to understand how teams are deploying, how often do they deploy, but also for an audit perspective it's been awesome because an auditor comes to us and says, hey show me all your deployments we had from January 15th to February 15th and we can quickly in our tool, pull up exactly what was changed on every single one of our production environments just within a few seconds. So needless to say, our SRA team has actually been very happy with RazeeDash. In addition, we can use RazeeDash to help developers understand how their code is being rolled out. This is what we'll get to in a second because now instead of going to one production cluster in Dallas, we're going to, I forgot lost count here, we're going to many, 40, 50 overall we've got about 60 different clusters that we have to deploy, just our production code out to.

In addition, we do have pieces of code that run across all of our customers clusters as well. So now we need to roll out code to tens of thousands of clusters and track which versions are running where. So one of our metrics that we do track is what is the drift? How many different versions of a particular microservice do you have? And we really want to try to minimize that. Early on when we had a dozen or so different clusters that we had our micro servers running on, we would have three or four different versions. So now with RazeeDash and the visibility that it puts in place, we can help make sure that that does not become an issue. And we typically run the same version at any one time across all of our micro services.

So the next thing is how do we actually build and deploy?

So like I said, when we first started this thing we were doing our build and deployment once, maybe twice a week and again we'd spend most of the time fixing our deployments. So we really wanted to re engineer and rethink how we do builds and deploys. So the first thing we answered are the first things that we answered, the first thing...one is that CI and CD for us are two completely different things. Our old process really tried to combine these things, we had this big fancy pipelines that say here's the build, here's the test, here's a deployed Dev, here's the deployed stage and here's deployed for production. We realize now that when you have thousands of production environments, you can't have a column with, you can't have a spreadsheet with 10,000 columns across denoting which versions are running in the state of all your different production departments.

So that's the first thing we realized, is that CI and CD are two complete different things. We've also moved away from a push model to a pull model, and what I mean by that is our old build process is kind of like what most people come up with is they have a set of Jenkins jobs, and the Jenkins job will take the code and it will connect to the target environment and deploy the code to that target environment. Now the problem with that is it doesn't scale very well. We had a hard time even getting the small amount of environments we had when we only had one day to center, let alone the 40 or so data centers that we have. So moving to a pull model means that now every cluster operates independently and can pull down and apply its own updates.

So the cool thing about this is that now we can do deployments in minutes rather than hours. We can also deploy, theoretically we can deploy code to 10,000 clusters within 60 seconds. I say that because we've actually done that before could versus should are two very different things. We tend to get overwhelmed with other things when we deploy 10,000 different deployments out to all of our clusters at once. But the moral of the story is that we've really had to switch from a more push to a pull based model for doing our deployments.

And then finally we've really focused on doing these rapid incremental builds. And this is again something Adrian had touched on in his talk as well. No longer can we do rolling out changes where we're changing tens or thousands of lines of code at once, we really want to focus on small incremental code changes to minimize the impact on the environment. Again to help understand what broke, and I'm and roll those changes back if needed. But again, our deployment processes is typically if we do break something, our first focus is to try to roll forward to fix rather than rolling back code.

So this is what I CI prices looks like and this is really doesn't look anything different other than the fact we've got our friends from LaunchDarkly in our picture here . CI processes we've got code in GitHub, we do the standard PR based model with approvals and once a code is merged into master Travis CI will build the code. It will then upload our images to our image repositories. We upload our Kubernetes deployment artifacts to cloud storage, and finally we will update a feature flag and LaunchDarkly saying that hey this new variant is now available to be deployed in the environment. Our CD process. So this is where we've spent most of our time building stuff or creating a new things.

So you can see right in the very center we've got this thing called cluster updater. So there's a small agent that we've run.This is the same agent that sends information back to our residents to tell us what's running. It also talks to LaunchDarkly to understand what should be running and if what should be running is not what's running, cluster updater will make the appropriate changes on that system. So the cool thing about our CD process and the coolest thing about Kubernetes is Kubernetes does all the hard work. We just say, Kubernetes run this version, run this new resource and it's done. So all we say is pull this particular version from cloud object storage and apply it based off of whatever rules are set in LaunchDarkly. So what does our deployment look like today? So a deployment today basically means that we go to LaunchDarkly for every single microservice that we have we have a feature for like... and by the way I'm going to go on a limb here and assume that everyone understands LaunchDarkly and concepts behind it, multivariant feature flags and things like that.

So every microservice has multivariant feature flag associated with it, and that multivariant feature flag has a variation for all of our builds. We learned really quickly that we actually need to keep that number of variations low because we were not cleaning up, we wouldn't have like thousands of variations for every microservice and it turns out that LaunchDarkly sends every variance down to every client so I think we're pushing a few megs of data down to every client or every cluster just to manage our rules. So to do a deployment, again our developers will go to LaunchDarkly, they'll select a particular feature flag for their service, they'll pull down a rule or they'll select a rule, they'll select which variants they want to play that rule and they'll save changes. As soon as they make that change, hit apply within 60 seconds, that version of that code for any targets matching that rule will automatically be running that particular version of that microservice.

So we've actually added some things during our build process and when we create that variant, we pushed some stuff from Travis, so we pushed like the PR message, we pushed the Git commit hash and we also pushed the Travis build number in to that. So when users are selecting which variant on the select, they have an easy choice, they understand what's going on there. So again, if a rule matches one cluster or if a matches 10 clusters or matches 10,000 clusters, our deployments are all the same. And the really cool thing about this is that when we bring up a new cluster, we don't have to change our CD system because all we have to do is when we bring the cluster it kind of identifies itself. Hey I am in this region, here's my name, here's the type of cluster I am and the rules in LaunchDarkly will automatically pick those up and apply all the appropriate software that needs to be on the system.

So in typical IBM fashion, we rewrote the LaunchDarkly UI in our RazeeDash tool. And there's actually a good reason for this is that one, we did not just want to give all of our developers carte blanche access to deploy any versions of code to any microservice, and at the time we are just using the team version of LaunchDarkly. So that meant that anyone who had access to LaunchDarkly either had read, write, or admin access to everything. So our development process focuses around GitHub, so our assumption is that if you have access to the GitHub repo that your code belongs into, you also have access to deploy that code. And so what we've done here is on RazeeDash you're actually signed in with your GitHub credentials so before you can actually make a change in RazeeDash and make a change to the rules through RazeeDash we verified that your user id has write access to the appropriate repo associated with that particular microservice.

And again it's actually, it looks almost identical. I'll be able to make some code changes to LaunchDarkly, but this is actually how our developers do their changes. In addition, there's some additional pieces that we do, our change management processes is also built into this so once the user select the new version and they hit apply, they go through the change management process, we push all the information, we know about the deployment that was made to our service now, changed mind process, once that's approved, most of these are approved automatically the request will be applied to LaunchDarkly. And again, the only thing we're doing is applying the rule changes to LaunchDarkly. So that's how we do our deployments, so the next thing is how do we deal with secrets?

And this is actually a pretty big issue we had early on. We had at the time I think about 13 different environments and every environment had a different set of passwords, different way to connect to sid, different tokens for OWASP different end points, all this environmental configuration that we had to track. What we did was we created a whole bunch of private GitHub repos and we start them all on plain text. So we thought that it was secure, it turned out like 89 people had access to all these GitHub repos so we really weren't that secure. So this is one of the issues we had to fix next, was how we dealt with all of our secrets and environmental configuration. So we came up with something called Armada Secure and what this does, this solves several problems that we had. One is that we had to find a way to deal with secret data rather than just throwing a plane text on a private GitHub repo.

Two is we have lots and lots of duplicate information. So, even though we had these, you know, dozens or so different GitHub repos, a lot of data was the same, especially for clusters in the same region. They all have the same FCD credentials, they all have the same BSS ids and things like that. So we came up with a new project called Armand Secure, and Armada Secure builds this configuration hierarchy. So keep in mind Armada secures all the single repo, GitHub repo. It's a public GitHub repo, public in the IBM sense that you have to belong to the GitHub at IBM.com but anyone in IBM can see this repo, and we've built this hierarchy. So anything that's at the very top goes everywhere, that configuration goes everywhere, to all of our clusters.

The next level down is data for a specific region. So again, anything at the lower levels can override stuff at the upper level and then the next step down from that as a cluster type. So, for example, you may have a type spoke in US south and you can define that down at the third level. And we can actually even go one more level down that, we can target specific clusters with our configurations as well. And so this is how we deal with the data deduplication is creating this hierarchy. The next thing is that all the data that's stored in the Armada Secure repo we try to keep it open as possible. It kind of goes back to the transparency culture shift that we had. So if anyone were to open up the repo, they would see they've got this config maps, here's the date of this config maps, here's a structure of the secrets, here's some service accounts, here's all these different Kubernetes resources that are defined that they're going to go to the Kubernetes clusters.

The one thing that people can't see are the sensitive information. So if you have like a SED password or if you have a particular BSS token or something like that, what we do with those is we inject those in when we build the Armada Secure project itself. We use GPG encryption so we have an asymmetric key, we check the public key into the repo itself so because of this, any one of our developers that has write access to this repo, they can contribute to the configuration for these environments.

... but what they can't do is they can't see into the configuration or they can't see any of this secret configuration. They can see the resources, they can see the stuff that's not private but they cannot see passwords and all the other sensitive stuff. This is one of the big changes that we had to make to, again, help our service scale. This type of scale means that, again, not having duplicate data everywhere and making it easier for developers to contribute to this configurations.

Finally step six. Ironically feature flags is one of the last things we did. We actually came to launch directly a little over two years ago. Wanted to use feature flags but as you can see we're using it to manage our deployments, not actually managing features. Recently, probably in the last year, we really actually started realizing we need to start using feature flags. We need to stop deploying features ... Actually we need to stop rolling out new features as a co-deployment.

We are actually in the middle of our transformation of rolling out features or enabling features using feature flags. One of the best quotes I like to say, and this comes from launch directly, is,  "Testing your production. " I love that quote. In fact when I talk to other folks at IBM about what we're doing and the feature flagging I specifically talk about testing and production. Lot of people think I'm nuts when I talk about this, when I say this, but once you go through and talk about the pros and cons of doing dark ... Rolling out new code and enabling features with feature flags rather than just rolling them as code updates, it's a pretty easy argument to make.

We've also ... Are using this to enable Alpha and Beta testers in our environment. Again, all of our code, whether it's new versions that are ready, new features that are in Beta testing, they're all running in our production environments. This is definitely been a god send for us to avoid that may 10th deployment of a new version or a new feature, everything breaks because we've never tested in production before.

Finally as part of using feature flags, we have a lot of new features that span across several components. We have API changes that we make. Command lines. UI changes. Billing changes. Lot of the times we use feature flags to coordinate these changes across all these different micro services. The one thing we're working right now is our docs team. I think I was talking with Heidi last night about how we're trying to coordinate our docs teams, since they're architect and organized a bit differently, to get into the fold as well as using feature flags.

Our Nirvana point is that some day we're going to be able to turn on a feature or turn off a feature and have the users command line, APIs, UIs, even the documentation automatically adjust according to which features they have access to.

That's it. Any questions? I know I've gone over a little bit here. Apologize because I know it's break time very soon here but I will be around this evening if anyone has questions. I will also make my slides available.

Audience Member: Yeah I was ... Sorry. You listed off a huge number of services that your team [used] to facilitate all of this. Can I ask you a bit of an incendiary question. Why didn't you just build the LaunchDarkly too? I was surprised you to pull in a third party service to solve that problem after everything else is-

Mike: One of my personal mottos is that,  "If someone else runs it and they can host it for us, I'd much rather do it that way. " We've actually looked ... We've talked about, at one point, building a feature flag service but then when you look at the cost of us doing it, even if you have one full-time employee doing this, it's way cheaper just to have LaunchDarkly host it for us. Not only that but LaunchDarkly's going to provide us a better service, more features. They're the ones actually keeping up with the industry to provide the new features and things like that.

In fact it's funny you say that because we used to actually even host our own databases previously until we switched over using Compose. Now Compose has rolled into the IBM cloud but then again that was another learning experience for us. It's like we need to stop trying to run our own stuff.

Any other questions? All right. Well thank you very much. Enjoy the conference.