In April we hosted our Test in Production Meetup back at the Heavybit Clubhouse in San Francisco. Michael McKay, Senior Software Designer at IBM, kicked off the evening with a talk about managing deployments at scale with Kubernetes and LaunchDarkly.
“The cool thing about this is that since everything is decentralized and we don’t have like a single central chain server, central deployment server pushing code out, we could theoretically push out new code to every single one of our clusters within 60 seconds because each of these Cluster Updaters is running independently.”
Watch his talk below to hear how his team runs the IBM cloud container services. If you're interested in joining us at a future Meetup, you can sign up here.
TRANSCRIPT
What I'm here to talk about today is deployment of scale, specifically around the IBM Cloud Container Service. A small plug about myself is I've been with IBM for 22 years. I started off doing SAP deployments within IBM before virtualization and before, we were still running like Token Ring I think at the time. That's how long ago it was. I actually moved from services to doing product development. So we were actually building products, putting them on a CD and mailing them out to customers.
For about the past three years I've been doing cloud development, basically learning what cloud native applications are, what running a service means. We've actually kind of have those journey with an IBM overall, kind of learning how to do this stuff. And what we're going to talk about tonight is where we're at today with the IBM Cloud Container Service.
One of the reasons I put this little arrow in here is when I did all my initial work doing SAP deployments, I'm finding a lot of parallels building, and running, and managing the SAP service, to even actually run a cloud service today. Some of the lessons learned that we took back 20 years ago are still applicable today. So that's the kind of stuff we'll get into as well.
And now the plug for the IBM Cloud Container Service, essentially it's hosted Kubernetes. If you go to ibmcloud.com and you say, “I want a new Kubernetes cluster,” you sign up, say, “Create me a cluster,” give some specifications. We do standard Kubernetes clusters using virtual machines or it can even do bare metal boxes. We do all of it for you. We set the clusters up, we manage them, make sure they're secure, all that kind of good stuff.
I'm gonna talk about something called Razee, and I'm pronouncing that wrong on purpose because I don't know how to say it the right way. If you're familiar at all with Kubernetes, you'll realize that everything has this kind of nautical naming to it. There's actually a Wikipedia page that talks about, it's just a list of nautical terms. Once you get into Kubernetes, what you'll find is you'll go to this Wiki page and you'll see this kind of matches what I'm doing. And that's what we do with Razee.
A razee is actually a boat or a ship that has been decommissioned and they have actually scraped some layers off the boat, like a couple of decks off the boat, to make it smaller and nimbler. So there's a little bit of meaning behind the Razee and that's what we came up with. Also the internal code name for IBM Cloud Container Service is our Armada. And the reason I bring this up is I'm going to start talking about Armada and Razee throughout the presentation here. Just making sure kind of the history around what it means.
So the scaling of Armada. I started this job about a year ago. At that time we had one data center in Dallas that we ran an entire IBM Cloud Container Service out of. And now we're running in 29 data centers in 15 regions. So in Dallas we have about four different data centers just in Dallas. We have them spread across the world. I think the only place we're not in yet is Africa and Antarctica, hopefully coming soon. I don't know about Antarctica, but. This kind of gives the idea of the scale of our application itself.
The architecture of Armada is for every one of those regions, every single one of those 15 regions, we have what we call hubs and spokes. Our hub, and one of the things I want to mention is that we basically manage Kubernetes with Kubernetes. Our whole control plane is Kubernetes and everything we deploy are Kubernetes clusters. So we have hubs. Once we run out of capacity to manage a specific amount of clusters, we will spin up another spoke and that spoke can then be used to manage additional clusters in that particular region.
Our customers clusters, we call those cruisers. And we have something called patrols which, when you go to the IBM Cloud, you can say, “Hey, I would like to try this out,” and you'd get a free cluster and so we actually manage those in a slightly different data center than we do the rest of them.
Some of the challenges that we have. We have 60 or so plus clusters just to manage the control plane. So, each one of these clusters has the Kubernetes master and a set of workers, networking, storage, all kind of stuff around that. So how do we keep those managed, those cluster cells up to date, and managed properly? And also, how do we manage the 1,500 deployments of our software across those 60 clusters. So Armada itself is made up of several different services. We have things like for the UI, for the API, deployment process, cluster process. So we have all these various of microservice and it totals around 1500 separate deployments just for our control plane.
And on top of that we manage upwards of 10,000 or more customer clusters. And each one of those customer clusters also has just a little bit of IBM code that runs on it as well. So now we have to manage code across tens of thousands of clusters and not only that but our team's themselves are kind of spread across the world. IBM loves to do this. We got so many people we just can't do it all in one spot. So we've got folks from our team in Raleigh and Austin, the UK, Germany. I'm know I'm missing some places. But the point is we have this global team spread across the world. So that also adds to some of the challenges. How do we make sure that all the teams can operate as a efficient team together?
So when I started the group they had taken a lot of the old kind of practices. You know, there was they were just deploying to a single data center. So had set up all these Jenkins jobs. Had to set up this pipeline to push him dev to stage, to pre-staged to production. And that deployment process take around three or four hours.
And we also tend to treat Armada itself as this big monolith. So we'd say, “Okay, here's Armada 1.5.3. Let's push this whole thing from stage into production.” And that actually was not scalable because we couldn't centrally build and push this thing out to six different places. I mean if we'd kept the same model we probably do a release maybe once a month. And on top of that since we're doing these big monolithic releases the release of themselves got very big. So you would have literally hundreds and hundreds of changes going to every single deployment.
So we kind of came up with some principles and what do we want to focus on? What do I make better for the container service? So the first thing we focused on is visibility. And this is kind of one of the things I'd learned 20 years ago doing services that most important things that we need to know is what's actually running where? When did it get it updated? Are they issues in the environment? So visibility is the first thing we tackled. Now not only do we know when we deployed something, but we know exactly what's running and every single cluster at any one time. In the past we knew what we deployed but we didn't know what was actually running. So that was a really helpful.
In addition with transparency … So this is kind of an IBM overall thing with transparency. In the past we used to have the RTC or CNBC for these code repositories. And it was kind of like a need-to-know basis to see the source code. So we've actually adopted this traditional GitHub model where everything is open. So if the team that works on the UI has a question of API, they can just look at the code for API. They can submit a PR for a code change. So this is all stuff that we in the room all know and love, but there's something that we had to learn over time.
And the next thing is kind of decentralizing and decoupling our services. So I mentioned before, most of our services were… we kind of treat Armada as just one big monolithic thing. And we've learned now to decouple these services. And because we treated as a monolith, there's like two guys in an organization of over 200 people, that it could actually push out a release of Armada. And so their jobs… They hated their jobs because like every… they just basically either doing deployments or freaking out how to fix a broken deployment.
And so part of the decentralization was, “Well, let's take some of that, you know deployment steps and building and processing the actual microservice. Let's push them back to actually squad it on the code.” And what's interesting is when we first talked about this there's a lot of pushback from the two guys who actually had to do the releases. And their excuse was that, “These guys don't know how to do deployment.” And our response back to them was, “Well, they need to learn how to do it because you can't really support your code in an environment if you don't know how to deploy it out in the environment.”
So that's actually one big step that we took. And also decoupling. We had kind of interdependencies between our code, between our services, so they weren't really true microservices. And we're kind of about halfway through. We still have some dependencies. But we've managed to break all of the dependencies because this is one reason that people wanted to push this monolith because they would say, “Well, we've got these 25 things. We know it works in this particular configuration. And so we can only push in that configuration.” And we actually just kind of broke that mindset. And that's just more kind of just like a live and learn. We just decided one day. “Okay, everyone's going to do their own deployments.” And guess what? The actual sky didn't fall and the sun still rose the next morning. So we've actually been kind of moving much faster and much more efficiently because of that.
Standardization is a big thing for us which is kind of an oxymoron because I just said that we're trying to decentralize and decouple. Yet we still want some standardization. So we didn't want the teams to truly go through and do whatever they wanted to. So we set aside some standards like you have to name your deployments the same as your GitHub repo, you have to use Travis CI to do builds. Things like that, kind of best of practice. Our best of breed processes and tooling is what we focus on. It doesn't mean that just one team comes up with these processes. If another squad says, “Hey, we found a better way to build this thing.” Then we'll investigate that and then if it honestly works better for everyone, then everyone else will pick that up.
And finally simplicity. And I don't know if this is a industry-wide thing or an IBM thing, but we'd love to make complex over-engineered architectures. So this is one thing that … A lot of times when we do PRs for our new changes to our processes and pipelines and things like that, we will actually go back and say, “No, it's simpler.” And sometimes it's actually simplicity means no automation. This is actually one things we learned because a lot of times you hear about this like, “Automate your pipelines as much as possible.”
Well, what we'd found out is that certain things like automatically promoting code from Dev to Stage to pre-prod, and production didn't really buy us anything because it's actually easier for the developers who wrote the code to go through and click a couple of buttons to actually do the deployment. And I'll show you in a little bit on the demonstration here, but we've actually made the deployment so easy that we didn't really need the automation to do the promotion through the different stages.
So the Razee CI/CD process kind of looks like this at a very high level. And you'll notice we got, you know, it's mostly basic stuff: Travis CI, Kubernetes, GitHub. One thing I noticed in here is LaunchDarkly. And this is one of the I think new and novel ways we've been using LaunchDarkly. We actually use it to control our deployments. So we actually feature flag each component in our environment. And that feature flag is a multi-variant flag and we can just say, “Hey, I wanted to roll this particular component out to you know, five percent of the customers clusters in Dallas” or “I want to roll this out to everything in the AP North Region.” And we'll get to that in a little bit here.
So one of the things we did come to realization is that CI and CD are not the same things for us. This is I think a problem we've tried to solve in the past is like, “How can we get CI and CD to work together seamlessly?” And we just gave up and said that, “We're not going to do that because we're gonna commit code. Travis is going to build into an image. And that's our CI process. It's just building an image. That kind of stops once the image is in the repositories.”
So we have our own GitHub Enterprise service, which coincidentally is actually the largest GitHub installation besides github.com itself. Is that GitHub to ibm.com. We use Travis CI for all of our builds. So we do all of our Center, Docker builds, and Linton, and code skins, and uploading images to the Docker registries from Travis. We have our own Docker repositories that we offer as IBM cloud service. So we upload that to there. We actually have some service that we uploaded Docker Hub itself. But that's still kind of frowned upon at IBM so we try not to do that very often.
And we also use IBM Cloud Object Storage to store kind of non-image stuff. Like all of our configuration for our clusters is actually stored in Cloud Object Storage. And finally, once Travis has done building it and upload the images we'll tell LaunchDarkly, “Hey, we've got a new variant. We got a new version of this image available.”
So then what happens is our CD process. So on all those clusters all the tens of thousands, specifically the 60 control plane clusters, we have something called Cluster Updater that runs on all these clusters. And what Cluster Updater's job does is it's checks to see which particular services are running on a cluster. Then for every microservice we will compare the version to what LaunchDarkly tells us should be running that cluster. So for example, our mod API may be at version 111 and it may ask LaunchDarkly, “Hey, which version is your Armada API at?” And it comes back 112. Cluster operator will then basically set that new image on the Kubernetes cluster and Kubernetes will do the deployment from 111 to 112.
It's a little bit more… we actually use git commit hash for everything not actually version numbers, but you kind of get the point that all of our deployments actually are … LaunchDarkly is kind of our source to truth to tell all of our cluster of what should be running everywhere. On top of that, any configuration we have everything that's not an image also comes as versioned and the version comes from LaunchDarkly. So LaunchDarkly will say, “Hey, Armada secure. You should be running version 120.” If 120 is not applied on the particular cluster, the cluster updater will pull that down from Cloud Object Storage and apply it to Kubernetes.
So if you're not familiar with Kubernetes, basically everything's a resource. And we've actually chose to use the YAML representation as a resource. So Kubernetes actually does most of hard work for us. So doing like the rolling updates, managing any changes or health checking, liveliness checks and like that Kubernetes does that for us. We basically say, “Hey, Kubernetes. Go run this version of our meta API. Here's the URL to the image. And just let me know when you're done.” So from that perspective it actually works really well.
And finally, the last thing Cluster Updater does it posts that information basically the current state of the cluster, back to a service we created called RazeeDash. And that's where the visibility comes into play here.
So that's kind of actually how we deploy stuff. So one of the other problems that we had to solve was how do we manage configuration? How do we manage the configuration of, you know, 10,000 in 60 clusters? And so we started specifically with the 60 clusters as part of our control plane. And what we found is that most of the data was the same across all these different regions. So we had a few things that changed like the SED, user ID, and password in URL were different, tokens for oauth and things like that.
So what we did is we built a repository called armada-secure and at the very top of the tree is all of our general configurations. So anything that gets applied to every cluster goes to the top. And then the next level below that is per region. So if we have something specific for like AP North or US East will put that configuration in that directory. And then finally down to the type. So we have those carriers, spokes, cruisers, and actually patrols, and we also have another one for razee to run some of our infrastructure tooling. And again at that level,0 that level will be applied specifically for that particular type of cluster in that particular region.
So what this allows us to do is now have a single a project which manages the configuration for all of our control plane in one place. And the cool thing about it is that any kind of offers like a single direction of encryption. So we make the public key easily accessible for any member of the organization. So anyone can take data, encrypt it using that GPG key, check into the project, and then during our build process we actually decrypt that. We will encrypt it for each cluster. So every single control cluster has its own public-private key pair. And when data is pushed or pulled actually to these clusters, only that cluster can decrypt data for its own particular use.
So that's actually how we manage the configuration of all the elements. And it's actually been really cool to see this actually grow because again, we started where everyone had their own configuration. Like Armada API had their own configuration, Armada UI. And like 85% of the stuff was the same. Getting everything to kind of meshed together into a single project was yes, it was a bit of a hassle. And we had a lot of confusion around was happening. But at the end of the day everyone's been super happy about it. And they've been finding that just managing the configuration itself has been much, much easier.
So that's my spiel, my pitch. I'm going to jump real quick just kind of show you what the tooling actually looks like. So let me make sure I can get this to work here. So let me end the show and okay.
So let's start actually with RazeeDash. So this is an application we wrote and this is kind of solved our visibility problem. So again every one of the clusters every, you know, all the 10,000 in 60 clusters, every 60 seconds they report back information about what's running in the environment. So from that we can see here, that there's … Of course, I lied. It's not 10,000 because we're actually in the process of rolling out to these 10,000 clusters. So we have 2,700 clusters right now that's reporting. And of those, there's 37,000 deployments.
We can also track things like as deployments change over time. So since we've got all this information, we collect every 60 Seconds. We can now do things like now on what's currently running but we can tell when things change we actually record when things change as well. We can see recent deployments, which cluster, the name of the deployment, and when it was updated. So one of the things we can also do is we can go to the clusters and we can do a search. So we can do like you know, prod-carrier.
So these are all our production carriers that we have running in the environment. And if I pick any one of these I can see that this particular cluster is running all of these services. So they're in Armada namespace itself we have like 20 or so different services. We have things wearing the kube system namespace. We kind of talked about standardization. Since we standardize on GitHub commit hash for all of our image tagging, we're now able to quickly see the version of the microservice itself, the version that it's running, the version LaunchDarkly says it should be running, and when it got changed.
So if I click here on when it got changed, we can see that, you know, they went from this version to this version nine days ago. Two months ago they'd updated. This obviously is one of Microsoft's. It doesn't change this often. But if we go to something like Armada UI, they should actually be changing more frequently. So you can see 2 days, 12 days, 14 days ago.
So my excuse for the limited amount of deployment because for a while Armada UI was pushing out code almost once a day. We're kind of in the middle of this GDPR compliance, shindig right now. So that's actually slowed us down a bit, but we do have this history here. And since it's all revolves around GitHub, like I can click on this link right here and it'll tell me … basically bring me to the chain set that went into that particular version of the code that's running.
So this provides us the visibility into what's running on particular cluster. I also have different views. I can look at across for deployments. So in this case, here's the RazeeDash application. You can see that it's actually deployed at two different clusters here. If I click on the deployment name itself, we can see a little bit more information. Where it's deployed. In this case it's only run in two clusters. And also the deployment rules themselves. Now these deployment rules just mirror what's already in LaunchDarkly. So we're using along starkly as a source of truth. The cool thing about LaunchDarkly is that you can set up these really cool rules that say, “If it's Wednesday and you're in Dallas and the sun is shining, then roll out this version.” So obviously we're not going to go that crazy with it, but what we usually say is that, “If your region is this, AP North, you're gonna get this version. If your cluster name starts with dev dash, then you get this other version.”
And what we've done here is we've exposed just the rules and the version that associate with those rules through RazeeDash. And the one reason we did this is we wanted to tie all the access controls to who can actually set these deployments, to people who have write access to the GitHub repos. So the RazeeDash application itself, when I signed in, I signed in using the standard oauth mechanism built into GitHub. And because of that RazeeDash now knows which organizations that I belong to, and which repos I have access to. So what happens now is that it knows that this is my GitHub repo URL. It knows that I have write access this, and it actually allow me to change rules. If I only had read access, if I didn't have access to it, it won't allow me to change these rules.
So that's how we typically do deployments. And we also audit all of our changes through here. So if I look on our profile. Actually if I go to Audit Log, we can see everything that changed throughout RazeeDash here. So we can see things like when they changed the rules or when we archived clusters and stuff like this. So it's all audited through here. So that's how we kind of view what's running, um work actually kind of coming of new ways we can kind of mine this data to see how we can make our development process better and more efficient.
And finally. And so this is LaunchDarkly. So LaunchDarkly is actually what does the magic for us, which allows us to manage these thousands of deployments across many, many clusters. So for every service that we have, we have a feature flag. So in this case, I'm going to pick on armada-billing because they actually do things the right way. And so armada-billing has a set of rules. And we'll scroll in here. So for every rule that you'll see, for example, we have a rule that says if the cluster name starts with dev, roll out this version. Or if the cluster name is stage of south roll out this version.
And the cool thing about this is that since everything is decentralized and we don't have like a single central chain server, central deployment server pushing code out, we could theoretically push out new code to every single one of our clusters within 60 seconds because each of these Cluster Updaters is running independently. And every 60 seconds they're checking for updates. So if we wanted to we could actually push out new code within 60 seconds to every single one of our clusters. So we could update 1,500 deployments all at once. Should we? No. Could we? Yes. But what we've chosen instead is kind of, and this is kind of what we settled on, is that we'll roll out by region.
So well actually pick on the Australians first. We'll roll out stuff to Sydney. Make sure it doesn't break there first. That's kind of our canary test. And so we'll roll out stuff to Sydney. Make sure nothing breaks in the carts. This is after we've already gone to our existing stage environment. But we do find is that, you know, there's no place like production home to test stuff in. Because we'll post off the stage. It works great. We'll put into pre-prod. It works great. And will push out a production. Bam. “Oh, we forgot about this or did this.” So what this allows us to do is to just easily … We test stuff in Sydney. The cool thing about the process here is doesn't really … We can go either way. So we can go forward or backward releases.
So in this case we're her. And, for example, on dev they're running version 717 in dev. If I don't go back to 712, I'll just click save here. And within 60 seconds, this new version will be running for armada-billing, right? So I can see armada-billing and that was on m-dev. So we should see here. So here's our armada-billing, running on dev-south. I think it's carrier 5. And then, hopefully what we should see here quickly is that one of these will actually get flat updated to like a few seconds ago. So that's actually how we do our deploys. It's just going through changing a feature flag and having that rolled out.
And then it's kind of up to the squads themselves. We kind of talked about the squad autonomy and letting them do their own thing. It's really up to the squads how they actually want to roll to code out. We, again, we're kind of setting best practices. “Let's do it by region. Let's automatically push out the dev when the stuff is built.” So we kind of have a built-in to the feature flags themselves. I mean into the rule sets themselves.
And the cool thing about … I don't know if you noticed when I drop this down. So the variations themselves are generated from Travis. So when Travis is done building the code it will actually create the variation and the variation contains not only GitHub commit hash, but also contains the Travis build number so you have some sort of sequence in here. But it also has the GitHub pull request message so we can have some human or some English text to say what actually what went into the change.
So that is how we build and that's how we deploy. And with that, thanks for listening.