Adrian Cockcroft has had a long career working at the leading edge of technology, and is fascinated by what happens next. In his role at AWS, Cockcroft is focused on the needs of cloud native and "all-in" customers, and leads the AWS open source community development team. Prior to AWS, Cockcroft has worked with organizations like Sun Microsystems, Netflix and eBay. He has been recognized as a Distinguished Engineer, is one of the founding members of eBay’s research labs, while at Netflix began speaking regularly at conferences promoting ideas around DevOps, microservices, and cloud and containers. Cockcroft holds a degree in Applied Physics from The City University, London and is a published author of four books, notably Sun Performance and Tuning (Prentice Hall, 1998).
And if that code it built fails, it actually deletes the code from your editor, and you have to type it from scratch again. This is to force you to type ... It's amazing because you start doing one line at a time commit, right? And the system has to be pass tests, pass your test driven development environment all the time. So it's called TCR, I think, Test, Commit, Revert or something like that. He wrote a paper on it last year. Really interesting concept but what you end up with is a system that you are incrementally always going to have it working. And it forces people into an incredibly lightweight update model. So you want to get to turn on features for tests when it works for everybody. One of the things, even when I joined Netflix in 2007, way before all of this cloud and microservices stuff, every single thing they did was an AB test or behind a feature flag, every single thing. It was like you couldn't get a feature out unless you proved that you had two test cells, with non-overlapping competence intervals, statistically valid, this is better than that. And they would do it over and over again just about everything. We were sort of tuning personalization algorithms with this and changing the shape and layout of the home page and all of those things were done in that way. So it's an incredibly powerful way, and they've got very good at ... The product manager's job was to figure out a new idea, something worth testing. And the engineering manager's job was to figure out what's the fastest way you can actually test that to see if it makes the world better. And the other thing is, there's a bunch of research around this. If you come up with 100 things that you want to change, it turns out a third of them will make the product better, a third of them will make no difference, and a third will make it worse. And you can't guess which third is which in advance. Some of the biggest improvements came from things that we almost didn't try because almost everyone was convinced it wasn't going to work. One person thought it was worth trying, turns out it was a huge improvement. And the one that everyone thought was going to work turned out to make it worse. So you have to run all 100 tests. You have to try things. If more than half of the things you're trying aren't really speculative, you will miss the things that really improve the product. So it's a really interesting idea and obviously massive use for AB test flags and feature flags in driving that forward. So small changes give you less risk, faster problem detection, faster repair, less work in progress because you're always re-basing back to the mainstream, and you don't have work sitting out there on a branch waiting to be put back in. Much less time merging changes because the system hasn't changed very much while you did your work. Much happier developers because they change the world tomorrow. Every time I change something, the world is better tomorrow and I can go home and see this thing be better. And just much faster flow. So how do we get there? Key thing, I think, is to measure time to value for every team everywhere in your system. Maybe you have gyro or something where you can measure the tickets from commit to deploy. Whatever you have, try to figure out for every team, what are you actually doing and how long does it take? And then you can kind of incent people maybe with a little bit of social gamification or something. You put up a leaderboard for the people who ... and trying to make everyone get slightly faster over time. Learn to do lots of small things quickly. So don't get bogged down, speeding up absolutely everything that's going on. Because what we need to do is have a fast path to doing small things and maybe there's a more complex process we're doing bigger, more critical things. Don't try and use one process to do everything or when architecture to do everything because you'll get locked into a sort of a heavyweight process that isn't appropriate for everything. The other thing is to measure cost per deploy and drive that down. So you want to figure out what does it actually cost you to deploy in terms of dollars and people in their compute time or whatever? How many meetings, how many tickets were filed by deploy? You should have at least one to track that it happened with lots of automatic appending going on to it, but you don't need to have a ticket for lots of permissions and things all the way through. And then zero meetings per deploy; because if one developer is deploying a hundred times a day, they don't have time to have meetings with people about all of those things they're doing. That's part of that process. So a few good books here; Hypothesis Driven Development. There's lots of good discussion of this in the Lean Enterprise Book. Unlearned is actually a new book by one of the authors there about how to unlearn all your old habits barrier. Barry O'Reilly brought this out just a few months ago. The theoretical basis for doing all of this comes from Doug Reinertsen, that says a lot in here about queuing theory and the idea that you should make your units of work roughly the same size because you'll get better flow. If you have a big project mixed in with small ones it will crowd everything out. There's those kinds of issues he talks about. Then there's the Accelerate by Nicole Forsgren and Jez. Jez is on most of these books. Lots of the same names coming up. The survey data across lots of organizations showing that low latency time to value works. You get better outcomes, you have better growth in the company, all kinds of things get better. So let's dig into this book. Some of the data in this book, the fast Companies are 440 times faster than the slow companies in that time to value. What that means is that some people are doing things in months and other people are taking hours. And this isn't big companies versus small companies. The fast companies include large, small, public sector, commercial, all types of organizations are fast. And in those same market segments, there are organizations that are slow. So it's much more about the way you think about organizing yourself than the market you're in or whether you're a startup or not. However, since this book came out, they did an update. This book was based on the 2017 data. The 2018 data just came out, it came out last year obviously. And now there's another group of companies doing stuff in minutes instead of hours. And so this is basically cluster analysis, so there is a significant number of elite organizations doing stuff in minutes that other people are doing in months. Seven times lower chance of failing in production and 2,555 times faster lead time from commit to deploy. So what we want to do is learn to do simple things quickly and this unblocks innovation across the whole organization and avoid the sort of complex one size fits all processes, because by the time you finish defining what that process is, you could have done a bunch of stuff and it's usually obsolete by the time you've over defined it. So one of the traps I see is central architecture teams over defining what the architecture and the process should be, when you should just get out of the way and concentrate on focusing on a few other things like security, availability, scalability and not trying to over define the way the system should be built. So best IT architecture today, it's minimalist, messy and actually inconsistent because otherwise it wouldn't be learning, it wouldn't be evolving. These guard rails for security, scalability and availability is designed to evolve really rapidly and explore new technologies, and it supports low latency, continuous delivery. So that's the best IT architecture and some people have got really good at this. Other people are working towards it. All right, so that's time to value. I spent quite a bit of time on that. I'll get through the rest a little bit quicker I think. Let's talk about the next step. So once you've learned to do simple things quickly and repeatably, the next thing is typical sort of a cloud native architecture for greenfield large scale deployed systems. So let's say you want to build like a Netflix style backend or a mobile backend or a distributed network system that never ran in your data center. Typically, this is the cloud first, cloud native architecture. The cloud gives you the ability to distribute globally so you can put machines wherever they need to be. You get cost optimized, high utilization. You order scale everything so that you basically turn things off when there's nothing to do, and use a cloud native architecture. So some principles for cloud native, you pay as you go, and you pay a month after you've used the resources rather than in a data center you pay upfront. It's self service. There's no waiting, it's API driven. It's globally distributed by default. You're using cross-zone and cross-region availability models just as the way you just naturally set the thing up. Extremely high utilization. Your utilization should be many times higher, not just a little bit higher. Like if your data center, it's pretty hard to get much more than 10% busy on average across the systems because you need the capacity for peak and maybe at the weekends or midnight or whatever, there's nothing going on in them. The way that you should work on, and these architectures just scale everything down, scale it to zero if there's nothing happening and use a mutable code deployment. So you're deploying alongside the previous system. You're doing blue-green pushers basically. Let's look at how we should do this. And there's some choices really now between should you use containers or your serverless or both to get this done? So I'm going to start with a little analogy here. By the way, I'd like to say I do not do my own slides, you can probably tell. I failed out in high school and so I reached out to this company and got a nice relationship going where I send the money and they send me slides that look like this. The company is called Silver Fox Productions, happy to have referrals to them. I said what I wanted was a picture of a Lego space ship and they did this whole deck for me. So the user need is we want so build a toy space ship, that sort of millennium falcon or something, I don't know, if you squint at it. They're not too close to be infringing anything. So the problem we're trying to solve is to make a model space ship quickly and cheaply. And the way we would do it traditionally is we design the prototype, we then like carve it from modeling clay to get the thing we wanted, the right shape. You then make molds from the carved model. This is the way that most plastic toys are made. You produce these injection molded parts, you assemble all those parts with some robot-looking thing, and then you sell this finished toy. And the problem is that this takes potentially months to do. A good bit of 3D printing in the middle, this is kind of the way you do it nowadays. But still it's something where ... And this is sort of like the way we do traditional development, it's a step by step approach. The other approach we have, very rapid development, which is you have a big bag of blocks, Lego bricks, you have some instructions and then in a few hours, and this is my favorite animation, I don't even know how you do that in PowerPoint, but anyway. You have this finished toy which looks vaguely like the thing that you wanted, but it's made out of Lego bricks. And you just put it together. Now if you look at it, it lacks fine detail, it's a bit lumpy. You can see what it is, but it wasn't exactly what you wanted. It's really easy to modify and extend though so you can add little pieces to it. And then you can optimize it. So let's say I want the point ... Lego bricks aren't very pointy. I want to point to a bit here. So I can just build a custom Lego brick. If you look at what Lego does, they keep throwing in custom bricks that look like nose cones and stuff, or wheels because it's really hard to make a wheel out of Lego bricks. They're not round enough. So, you form these new things. You kind of get the analogy here, but you're building now a more specialized component that you can use and reuse in other places. All right, so what are we talking about here? Traditional and rapid. Instead of full custom design, we go to building bricks. Instead of months of work. You can do this in hours, custom components that you have to debug them, they're hard to build. It takes a while to get them working, but it has standard, reliable, scalable components. Everything basically just works. You get too many detailed choices if you're custom building and if you're working with Lego bricks, you're fixed. You just go faster because there's much less argument and discussion. There's only one way to build it. And so that cuts your decision cycles down, and the constraints in an architecture are one of the ways you move much faster. So finally getting to the point here, sort of containers and serverless. So if you're building something with containers, again, custom code and services, you can spend a long time just trying to figure out which container framework you want to do with serverless. You just get on with it and start building it, much more standardized choices. But what you typically do is build it. What I'm seeing people do is build really rapidly with serverless, and then optimize. So you rapidly throw something together then you say, "You know what? This bit here, I need better startup latency, or it needs to be a bit more efficient or I'm going to do some long running jobs. So we'll do that bit with containers." That's the combination. So I'd say serverless is great for a rapid prototyping and then you go build your thing later on, optimize it later for maybe large scale production if you need to. So that's the at scale piece. We are going to wrap up talking about this most strategic critical workloads as we get into the data center replacement. So we're starting to see data center, the cloud migrations for the most business-critical, safety-critical workloads. There's multiple airlines going all in on AWS. The entire airline doesn't want to have any data centers. There are several banks doing this, there's a bunch of industrial automation companies doing this. So we're working on things that are safety and business critical workloads. And we're figuring out the patterns and solutions across all these different industries. So here's a question, do you have a backup data center? Is everyone here? Hopefully you are all still awake out there. Who has a backup data center? Yeah. Few of you. Okay. How often do you fail over to it? People start looking worried about this point and say not often enough? And then how often do you fail over the whole data center at once and it turns out if you're in finance the answer had better be once a year because the auditors come and make you do it. And everyone gets bent out of shape for a few weeks preparing for that. So that's normal. And availability theater, you invested in a backup data center but you didn't really get any real value out of it. So this is a nice fairy tale we have, once upon a time, in theory, if everything works perfectly we have a plan to survive the disasters we thought of in advance, not a great ... It'll keep you up at night basically, and it's not a good fairy tale for bedtime story. So here's a few things. There was the SaaS vendor that forgot to renew their domain name. They were down for a couple of days. The only thing they had left was Twitter, and the CEO apologizing on Twitter for about two days because all the email was down. They use one domain for all their internal email and with product and everything. And if you think about how disastrous that would be, sort of like, "Okay, well how would that apply to wherever you work, if whatever your works name.com disappeared or whatever it is. There's a few things you can do about that like make sure your internal email isn't on the same domain as your product, for example. Or have multiple DNS suppliers or whatever. I would practice flipping around with DNS or something or make really, really sure you remember to renew your domains. So this one, everyone's seen this. Only occasionally do people forget to do the domain name. If everyone in this room hasn't suffered from a security certificate expiring and taking something out I'd be really surprised because it's happened. This is universal. We have like really detailed systems, AWS for registering them and tracking them and making sure they all get revised at the same time and all this stuff. We built something like that in Netflix as well. It's a lot of work and it's really annoying when they go wrong. Recently I think Erickson shipped something with a security certificate that timed out to the UK cell phone system, and I think it was one of the cell phone vendors in the UK went offline for multiple hours, totally offline, countrywide outage. And there's like several hundred million dollars worth of lawsuit going on right now from a certain time out. So that's probably the worst case I can think of right now. It turns out computers don't work underwater, and when the water goes away and they're full of mud and bits of seaweed, they also don't work very well. A friend of mine spent several months rebuilding a data center that was in a basement in Jersey city, and it was not a happy place. So these things can happen. Hopefully not you tomorrow, but could happen. How do we deal with this? And really the key thing here is to have fast detection and response because you can't really figure out what will happen. I mean, you can take the obvious things you can work through. Once you've worked out how to fix all the obvious things, it will be something you've never seen before with the type of outage you get. Chris Pinkham was the engineering manager for the first version of EC2 and I was on a panel with him years ago when he came up with this phrase. I just wrote it down. I've been quoting him ever since. Here's the problem.
There's a really good paper by Peter Bailis and Kyle, Kingsbury that the network is reliable, which of course it isn't. That's the first fallacy of distributed computing. And it says a long, long list of the way that things don't work. This book's really interesting. It says even if you do everything right and nobody makes any bad decisions at any point in the chain of events, you can still have a catastrophic outage. And there's a bunch of examples in this of planes crashing and people dying in hospital. I tell people not to watch it, not to read the book on a plane. The chapter two of this is about a plane crash into end and I did a talk at Go To Chicago last year, you can go find that talk if you want to see me doing a complete end to end discussion of a plane crashing. Just actually it's fairly traumatic. But anyway I'll go onto that more. But there's a lot you can learn from the airline industry about how they share. Every time there's a failure on a plane, it's shared with every other plane, from every manufacturer, every operator, there's a central registry where all failures are reported and shared. That's a really powerful concept when you need to start thinking about doing that kind of thing in IT as well. Release it: The first edition of this book was really critical. It was one of the books we read, Netflix a decade ago. We sort of dived in, we built a whole bunch of things. So if you had a circuit breaker there's a design path that came out of the first edition to this book. Michael upgraded the book a couple of years ago. There are lots more good ideas in there. Resilience: In the past we had disaster recovery and now we have chaos engineering, and in the future we're just going to have some kind of resilient critical systems. That's what we want to get to. So chaos engineering, sort of the history of this probably goes back to Jesse, who was one of the founders of Chef later on. But while he was at Amazon, he was called the master of disaster. He used to go around pulling plugs in data centers and making sure that the Amazon website didn't go down just before AWS. The other thing is he's a part-time firemen, like his spare time job. So he kind of deals with this emergency services idea. And he once simulated what would happen if a data center caught fire? He didn't just pull the plug, he says, "I'm going to shut it down in a sequence that simulates if there was a fire in this rack here, how would it spread and what order would various things fail." He simulated that once. And then if you think about a data center flooding, you discover whether you have cabling underfloor or above rack. Let's just bring the water up and like, okay, which bits of the data center are going to fail first? Okay, I have on my power distribution under floor, that's probably not going to work out so well. But if it's above rack, things are actually going to work for a little while before it fries itself. 2010, Greg was our ... I think the first implementation of chaos monkey was the first thing we ever shipped to the cloud before any applications went in there. The chaos monkey got there first. And if you're trying to do chaos, the sort of chaos they're engineering, always deliver the chaos processes before you put the applications in because otherwise no one will ever let you do it afterwards. I was like, "You'll just break everything." If you have to survive the monkeys to get into production, that's the right way around to do it. We open sourced it in 2012. Gremlin who are out here and I think Colton's on later, that was founded 2016. Chaos engineering book came at 2017 and I think the first version of this slide I wrote I was at the chaos conference last year, starting to see more startups. This is becoming an interesting area. So the way I think about chaos architecture is four layers, two teams and an attitude. This is the book that describes all the different things. The layers are the people at the top. So you want to make sure that they don't mess things up or break things that are working. The application has to be able to survive, seeing something it wasn't expecting. You better switch between things. That switching is probably the least well tested part of your system. If you think about a disaster recovery fail over to a backup data center, that is a switch which you didn't test. Every time you tried to do it in anger, it will probably break horribly. Because it's the least role-tested part of your system, it's the thing you're relying on to make you highly available. That just sounds wrong to me. On infrastructure, you've got to have more than one way to do everything. So chaos engineering team will use a bunch of tools to work through these layers like game days or the way you exercise people. If you're trying to get into this, if you do nothing else, nothing on the tooling side except have a game day, then at least people know what to do if something breaks. That's the first place I'd start. And then you can start seeing, "Okay, here's the tooling that we can make slightly better. Here's the way we can improve it." And then on the security side, it's actually very similar. The red team idea for security is that there's a team trying to break into your systems and there's a team trying to defend. And that approach actually means that you're testing everything as you go forward. The problem with both of these is you're only as strong as your weakest link. You've got to have a dedicated team trying to find that weak link in order to make sure that you're really going to survive. The other thing about failure is they're a systems problem. They're a lack of safety margin is the problem. And if you see people saying the root cause was component error or human error, that means the system wasn't designed to make that individual error go away or cope with it. If you have enough safety margin, human error wouldn't take out the system, or a component failure won't take out the system. So this is the core idea in sort of the new world of thinking about safety and resilience. What we want to have is experienced staff that know how to get on call, they know how to fix things. They know how the tooling works. They know how to get into all the different dashboards that they need to log into. You've got robust applications that have at least battle tested enough that they're not going to keel over completely at the first sign of something strange, a dependable switching fabric that you exercise regularly, maybe every few weeks or so on a redundant service foundation with whatever multiple ways to get things done that you need, networks and storage and regions and things. Cloud gives the automation that leads to chaos engineering. That was why we did it, why it came up at Netflix. Now we could actually automate our disaster recovery plans and build the whole thing by using APIs and a whole bunch of things. So the thing is that if you're trying to fail over between two data centers, the problem is they aren't the same, they drift. And when you fail over, you find that there's an out of date version here. Or the configuration wasn't exactly the same, or a limit there wasn't the same and it doesn't work. Whereas when you're API driven, you can run scripts that makes sure everything is the same in both places, and the services are the same in both places. So what I'm seeing is that failure mitigation, it's gonna move from this scary annual experience to automated, continuously tested resilience. And we'll be running that all the time. As we do that, we're going to be automating what we do, we're going to be needing to have a lot more ability to control things in production, and there's just a lot more version. So that is, I think, why things like LaunchDarkly and feature flags are really important for how you tie all this together and how you get on top of it. Because there's just a lot more features, a lot more stuff to do, and the state of the system is much more complex, but the outcome is much better. That's what we're all going to get to. All right. So I'm done. I got a minute or so for questions. If anyone wants to ask one. Audience Member: [inaudible] Adrian C: Yeah, there's versions. I usually put my slides on GetHub. adrianco/slides on GetHub. I have this idea that maybe one day Microsoft, since they own GetHub now, will make me do poll requests on my PowerPoint decks. But they haven't built that yet. Yeah, I'll put them up there. There's also similar versions of things around. Any other questions? Yeah. Audience Member: [inaudible] Adrian C: Yeah. So the question is about if you're building real time or critical systems like Boeing for example, probably the best example probably right now is Tesla. They do updates roughly once a month, something like that. They have a fleet of test cars they're deploying to all the time and they have feature flags in their code to turn these things on and off and updates. So I think that you still want to do lots of small incremental changes because each change is going to be much more separable and your ability to debug it is much faster. I went to see a Space X one time, I got a little tour and they had this huge bench with the guts of a rocket laid out on it. All of the electronics were rocket, and they have all the different versions of a rocket on a different bench. And then they boot the software in that and test on that. So they're doing continuous delivery into a hardware simulator. So I think that's the approach I take for something like an aircraft critical system. But you are not delivering once every six months or something you're doing, you're building the test driven development sort of framework so that you're running continuously with very rapid updates because that's just going to let you go foster at building things. I think that that's one of the interesting things. If you look at the way Tesla as a company is constructed versus other car manufacturers, they're on like a four-year release cycle for new product and tells us some sort of a monthly one where they continuously modify the car and make them slightly better. And that's because they go direct to consumer. They don't go through read through resellers, and actually that's ... the whole structure of the company is quite interesting. The way that they're able to operate is actually completely different to the way the other car manufacturers operate. It's a good example of a sort of industrial, continuous delivery operation, although there's a whole bunch of other issues. But it's a fascinating to watch. Yeah. Audience Member: As a developer, [inaudible] Adrian: So the question was how do you push this up through management? Hopefully the management is trying to push down to make you do something more strategic and get things to go faster. I'm seeing a lot of enterprises say, "We need to go to faster. The business needs stuff delivered and they've been blocked by IT or whatever. They've been blocked by the speed of innovation in the organization." So it's a lot of top-down incentive to move faster. And then usually the engineers coming up from below kind of go, "We like to do diverse ways that could go fast." And there's a layer of management in the middle that's getting in the way. And it's quite often project managers. If you think about moving from project to product, when you do that, you don't need project managers anymore, so they see the jobs going away. So hopefully not too many project managers in the audience here, but if you're a project manager, you should probably be trying to find a different job because we're going to decimate them. I mean they need about 10%. After you've gone through the transition, maybe 10% of what you needed before. Because if you're not running projects that take a year to run, you're doing continuous delivery, you need a whole different skill set around it. So that project to product book talks about that in quite a lot of detail and there's some examples of companies that have gone through that transition. But there's a corollary here that see their jobs going away and they will resist, so they have to kind of like pencil movement on them. By the way, if anyone wants me to come by and chat, I spend all my time talking to customers and talking to management, trying to persuade people to speed things up, I'm happy to come and give this deck to groups internally. I did a pre-run of this deck to a bank last week in London, so that's the kind of thing I do. I'm happy to be here and I'll be hanging around the rest of the day to carry on the conversation. So thank you.