Adrian Cockcroft, VP of Cloud Architecture Strategy, AWS

Keynote: Measuring Your Trajectory

As the pace of technology increases there is a strategic pressure to be more innovative more quickly. However, many established enterprises find it hard to "get out of their own way" and unlock the innovation that is already present within their organizations. What are the most fundamental metrics that support innovation? We’ll look at how a short time-to-value drives many other desirable outcomes, making change safer, more incremental and less wasteful. The sheer number of changes requires new tooling for automation and management of features and hypothesis tests at scale.

Adrian Cockcroft

Adrian Cockcroft

Adrian Cockcroft has had a long career working at the leading edge of technology, and is fascinated by what happens next. In his role at AWS, Cockcroft is focused on the needs of cloud native and "all-in" customers, and leads the AWS open source community development team. Prior to AWS, Cockcroft has worked with organizations like Sun Microsystems, Netflix and eBay. He has been recognized as a Distinguished Engineer, is one of the founding members of eBay’s research labs, while at Netflix began speaking regularly at conferences promoting ideas around DevOps, microservices, and cloud and containers. Cockcroft holds a degree in Applied Physics from The City University, London and is a published author of four books, notably Sun Performance and Tuning (Prentice Hall, 1998).

Adrian Cockcroft: Hi there. Good to see everybody. It's awesome to see this event and see everybody coming up for it. I think I first met the LaunchDarkly team when there were three or four of them at Heavy Bit when they went over to record a little video there. So it's been great to see the growth of the company over the years, and very happy that I was asked to give a talk here. So talking about trajectory and the idea that we're measuring a trajectory is to be able to figure out, what are we going from and what are we going to? And what's stopping us from getting there? And all the different pieces we have in between. So that's going to be my talk. I'm not going to talk any more about Formula One motor racing, but Amazon started sponsoring Formula One and we've got a whole bunch of things there. So this is the idea they're innovating at speed and trying to do everything under really high pressure is something that I've been working on recently and working with a lot of customers around the world. But let's look at where we're coming from. If we look at old world IT, say a decade or two ago, the scope of what IT was doing was tracking employees at work, giving them PCs on their desks maybe, factories and supply chain, a little bit of automation there. Sales channels were physical. You sent something to a shop or you went to a bank branch or you went to a travel agency to get something done. And marketing analytics were primarily TV and print based. You put something on TV and then you ask Nielsen if anyone watched it, then you could figure out how much you were going to get charged for that advert. So that's the old world. And what happened was the internet happened, and we have this phrase digital disruption or digital transformation, but this is what happened. Employees at work. Now you build mobile applications to make them productive, and they're using them all the time. People are much more connected. The factory and supply chain is now much more integrated. It's continuous. Everybody is tracking things in fine detail. Got just in time production running. Sales and delivery is now online. You're directly connected to it, and you have IOT connected things. So let's just pick a simple example. Let's say you are a company that made door locks for the last few decades. They were lumps of metal. You put them in a box. You sent them to Home Depot, and if that box ever came back, it's because something went wrong. It was send and forget. You just wanted to get paid for it. Now, if you're a manufacturer of door locks, your door locks are calling you every five minutes. And if you don't hear from them every five minutes, it's a problem because it means the battery went flat or the customers ripped the door look off and put on one of your competitors door locks. So we've gone from a world where it's just send and forget, to continuously tracking everything you've ever made for the end of time, until they go away. And then marketing's now online as well. So if somebody says something bad about you on Twitter, you've got to respond immediately. You can have an out of control social media news thing blow up in minutes, and you've got to be tracking this continuously. So the scope of IT used to be these relatively small scope things like number of employees, number of factories, number of sales outlets were the scope. Now it's every customer you've ever touched, everybody that's on Twitter or on Facebook, and everything you've ever made. That's why I think there's this transformation, and this is the trajectory everybody's on is between these two worlds as they're trying to deal with it. So what that means is you now get to do personalization for everything, and that was the reason I joined Netflix in 2007 because I knew they were doing personalization and I wanted to learn more about it. The first job I had at Netflix was working on the personalization algorithms, managing one of those teams before we move to cloud. And in this customer analytics, you're gathering all of this information about your customers that you didn't use to be able to get because you were indirect. Think of the Netflix example again. It used to be you made a piece of content, you sold it to a cable TV company, and they put it on at 9:00 PM and nobody knew who actually watched it. Now Netflix knows exactly who was watching everything. They've got a full feed of everything and they have new channels that go directly to the customer. So this is pervasive across all industries. We're seeing it in automotive now with Tesla being directly connected, doing continuous updates to their cars. Other car manufacturers trying to figure out how to do that too. What we're seeing is many more things, much more scale, and really rapid change. So it used to be you could issue a new version of your software maybe every six months to a year and that was fine. Now you've got to be in continuous delivery. You've got to be delivering new value to your customers every day to make their life slightly better because one of your competitors is doing that, or maybe there's a start up in your field that's doing that and you can't keep up anymore. So what I've been doing with AWS is helping unblock this innovation. Lots of enterprise customers going through this. So I'll talk a bit about the things that I see blocking a lot of enterprise customers and then the stages they go through. So here's the blockers. We've got culture, skills, organization, and finance. So let's just go through these quickly. On the culture side, it's really about leadership systems and feedback. If you've got centralized slow decision making, that really slows everything down. The job of the central management should be to hire the best people they can find to get out of their way, and then check that they're making good decisions. So it's a trust but verify model. And if you don't have trust, you have in flexible policies and processes, you're going to be slow at innovating on anything. There's a couple of good books here. If you're trying to migrate to cloud, Steven Open wrote this whole series of blog posts that got rolled up into a book about how to migrate to cloud. And then Mark Schwartz, A Seat at the Table. How to be an agile CIO. He's got a new book coming out called War and Peace in IT. I'm one of those people that gets access to books before they're published because they want me to say something nice to put on the back. So I have read that book and it's really interesting. Lots of Napoleonic analogies. And that book is written for the people who aren't in IT for how to make IT work for them. It's the counterpoint. You're looking at it from the point of view of an agile CIO, then looking at it from the point of an agile organization. What things should you be doing to leverage IT and leverage technology? So Mark Schwartz, if you don't know him, is a really entertaining speaker. Go find his videos on YouTube or whatever. He was Department of Homeland Security, Immigration Naturalization Service CIO, which sounds like the most fuddy duddy boring thing you could possibly have. When I first saw him at a conference, everybody was just amazed at the story he had. He turned an entire year long waterfall process into something where they were releasing code continuously in an extremely agile way, running chaos monkeys is doing the whole thing. So it's an amazing story. The system he was working on wasn't needed after the Obama administration moved on, so he was looking for a job and we were lucky enough to hire him at AWS. So I was happy about that. So that's leadership systems. Let's look at skills. How do you deal with training people up to deal with this new world? To go into cloud? To get into all these things? Basically, you can go try and hire people, but most people you end up training the people you've got. You fund pathfinder teams. But if you don't figure out how to incent and retain people, then you'll train people up and then they'll leave because they now have higher market value than people who aren't trained in the latest technology. So quite often I see this issue, people just blindly training everybody and wondering why they keep leaving. So have some way to incent them up front. If you're a startup, this is a great book. If you're an enterprise, it will just make your heads explode or make your HR teams' heads explode. This is the story of the Netflix culture from the inside. Patty McCord was the head of a chief talent officer. She basically ran HR as a recruiting oriented HR organization rather than a human resources. So it was very much organized around recruiting and finding people and getting out of their way and making really powerful employees. So really useful book. I got lots of book recommendations in my deck here. Next thing. Organization. Move from projects to products teams. So that gives you long term product ownership. It means you're iterating. It means you're making the product slightly better every single day. Rather than forming a team, having them work on something for six months, delivering it to operations, and then running away from the project and working on something different. So there's a very different feeling when you are continuously owning something and you're on the hook to make it slightly better all the time. This also goes back to the dev ops idea of run what you wrote. This is something that someone did an ACM article in about 2006 where he talked about this. We all read it at Netflix. We figured out that meant we put all the developers on call, and we taught them to run what used to be a JAR file that they would deliver to QA to be integrated into the monolith. We told them to make that into a service, put an API on there, made that into what became called microservices. It turns out that if you put developments on call, they write really reliable code and all automation and you can teach developers to operate, and the ones that didn't figure it out were made to go away or put somewhere where they couldn't cause any damage. But approaching dev ops from the developer end of the discussion, rather than approaching it from the operations end where operations people learned to code and build much more automation. So this is dev ops met in the middle with these two approaches. And you still see some bifurcation in the dev ops movement as these two approaches. Did you start with development or did you start with operations? The other thing about this is it removes tech debt and lock in, because if you have a continuous improvement team that owns something and you want to switch out a database vendor, it's just another thing in the burn down elicits. It's something else you're going to do in the next sprint or whatever you want to call it. And it's a series of micro dependency management changes. Rather than being, well there's no development team on this thing because it got delivered to operations and they're running the results of this project, and they were running ... You have to form a new project to change anything. That's when you're locked in. So think of handing over from development to operations on a project basis as the point where the key is turned in the lock. That is where lock in happens. What you want to do is have continuous delivery, continuous improvement, and obviously there's a lot of need for managing versions and managing features and the whole feature flag thing. I'll get to that a little bit later. Couple of good books here. Project to Product by Mik Kirsten came out about six months ago. It's a good book on the subject. And then the dev ops handbook will have lots of great details on how to do this, both from the IT revolution organization who's published a lot of good books recently. So this is just the sum of some people. When you get into a really deep transformation, and talking about large enterprises, they get surprised when they start shifting a large amount of their capacity to cloud and all of a sudden the CFO gets involved and goes, "Hang on a minute. What happened to all my CapX? What's all this operational OpX happening?" And it also happens when you go to a dev ops model, because if you typically capitalize development work and you expense operations and when you blended into one organization as a dev ops org, that messes things up. So anyway, this is something that typically enterprises run into when they're quite a long way into their cloud transformation and it can be ... Put the brakes on the transition. We've seen this with some of our larger customers going through this transition. Something to watch for. So let's talk about the pathway. How do we get there? Those are the blockers. If we can get through those, what do we want to do? So the first thing was we're going to try and go fast, concentrating on time to value. Then we're gonna look at scale, the cloud native distributed optimized capacity. And then finally, we're getting more people do strategic migrations where they are just exiting data centers and moving mainframe workloads and things like that to the cloud, data center replacement. We didn't have time to rewrite it because the building's being shut down in a year and you're on a deadline, that kind of thing. Let's get into ... This is actually a bunch of brand new slides I haven't presented before. Most of the slides I've used up to now are things I've done elsewhere, but this is the new bit I did just for Trajectory. So what are the fundamental metrics for innovation? How do you know you're going quickly? And one of the things is that you don't add innovation to an organization. You get out of its way. If you see someone hiring a VP of innovation, they're probably doing it wrong. Innovation theater was a phrase I saw yesterday somebody had. You're going through all the trappings and movements, but you're just doing theater. You're not really doing it. So what really matters is time to value. You do some work, and how long does it take until that touches a customer? Let's just take that as the flow that everything goes through. And a lot of organizations, months is the answer. That's pretty typical. If you've got a monolith or you've got an old style organization, every few months there's a release and it will work for hundreds of developers and changes get bundled into that release and you have a little problem making it work and then you stick that out. And then as people got more agile, they started doing things in days, which is pretty good. That can be reasonably competitive, although we're now seeing a whole lot of people just running continuous delivery where you change something and minutes later it can be touching the customer. You can fix things in minutes. And that's state of the art now, and there's a considerable number of organizations running in that way. And the key thing here is that there's no economy of scale in software. Normally, when you bundle things together, you get some economies of scale, you're batching. That actually makes it better. The problem with software changes is if you bundle them together it actually makes it worse because then the interactions become harder to untangle. It's harder to debug. It's harder to figure out and you're more likely to break things. So it turns out smaller changes are better, but we're stuck with this world where we want lots of small changes. We want to do that very quickly. So we need an automated continuous delivery pipeline. We need tagging, feature flags, and feature tests, and I need to change that central logo to be a LaunchDarkly logo. But we didn't have time this morning. But next time I edit this, I'll put that in there. Make Edith happy. And then you need rapid, cheap builds. And what's the problem? So if you're running a Java monolith and you're trying to go to microservices, it might take hours just to do a build. I was talking to someone recently who said, "It takes 11 hours to build our monolith." It's Java, and the system grinds away for 11 hours and then the things spits out and then you get to test it. Another company I was working with said, "We have a pretty large complex system written in Go, and it still builds in less than a second." You do, "Go build," and it just returns. It's done. And it's ridiculously quick. Each microservice takes seconds to build and then you can deploy it. So some of this is the choice of the language and some of it is the choice of the size of the thing. So if you build small enough Java objects, then probably you can get your build time to be fairly quick. And if you built a really huge system, maybe even Go it would take a little while. But Go is really optimized for rapid build times. And there are other languages that are ... Obviously if you're writing stuff in Python or JavaScript or Node or whatever, you're not ... There isn't so much of a build process in there. Maybe there's some more packaging. But this is the thing. You want to get out of the old slow way of doing things and build something that's fast, because if you change one small thing at a time, it's easier to tell if it broke, what broke. Change one small thing at a time, it's easier to tell if it broke, what broke, it's easier to roll back to the previous version because they made one small change. Oops, roll it back. And if you're running side by side, you just stop sending traffic to it because the previous version is still running. And it's much easier to measure time to value, have a precise measure because here's the small change I do one check in, and I have one deploy. So if there's a one to one mapping, you've got a very precise measurement of time to value. Whereas if you look at say a six month release with hundreds of check-ins in it, I was talking to another customer, and they said they had 600 tickets per release every six months. So which of those tickets is set to the time I did the work versus the actual value was when that final release came out, right? So it's a much blurrier or idea of what your time to value is. So what we're trying to do here is decouple new code from new feature. To incrementally change the system of many small safe updates. And if you were to really go to the extreme here, Kent Beck has been playing around with an idea where you type a line of code, you hit save, and it runs tests on it. And if that code it built fails, it actually deletes the code from your editor, and you have to type it from scratch again. This is to force you to type ... It's amazing because you start doing one line at a time commit, right? And the system has to be pass tests, pass your test driven development environment all the time. So it's called TCR, I think, Test, Commit, Revert or something like that. He wrote a paper on it last year. Really interesting concept but what you end up with is a system that you are incrementally always going to have it working. And it forces people into an incredibly lightweight update model. So you want to get to turn on features for tests when it works for everybody. One of the things, even when I joined Netflix in 2007, way before all of this cloud and microservices stuff, every single thing they did was an AB test or behind a feature flag, every single thing. It was like you couldn't get a feature out unless you proved that you had two test cells, with non-overlapping competence intervals, statistically valid, this is better than that. And they would do it over and over again just about everything. We were sort of tuning personalization algorithms with this and changing the shape and layout of the home page and all of those things were done in that way. So it's an incredibly powerful way, and they've got very good at ... The product manager's job was to figure out a new idea, something worth testing. And the engineering manager's job was to figure out what's the fastest way you can actually test that to see if it makes the world better. And the other thing is, there's a bunch of research around this. If you come up with 100 things that you want to change, it turns out a third of them will make the product better, a third of them will make no difference, and a third will make it worse. And you can't guess which third is which in advance. Some of the biggest improvements came from things that we almost didn't try because almost everyone was convinced it wasn't going to work. One person thought it was worth trying, turns out it was a huge improvement. And the one that everyone thought was going to work turned out to make it worse. So you have to run all 100 tests. You have to try things. If more than half of the things you're trying aren't really speculative, you will miss the things that really improve the product. So it's a really interesting idea and obviously massive use for AB test flags and feature flags in driving that forward. So small changes give you less risk, faster problem detection, faster repair, less work in progress because you're always re-basing back to the mainstream, and you don't have work sitting out there on a branch waiting to be put back in. Much less time merging changes because the system hasn't changed very much while you did your work. Much happier developers because they change the world tomorrow. Every time I change something, the world is better tomorrow and I can go home and see this thing be better. And just much faster flow. So how do we get there? Key thing, I think, is to measure time to value for every team everywhere in your system. Maybe you have gyro or something where you can measure the tickets from commit to deploy. Whatever you have, try to figure out for every team, what are you actually doing and how long does it take? And then you can kind of incent people maybe with a little bit of social gamification or something. You put up a leaderboard for the people who ... and trying to make everyone get slightly faster over time. Learn to do lots of small things quickly. So don't get bogged down, speeding up absolutely everything that's going on. Because what we need to do is have a fast path to doing small things and maybe there's a more complex process we're doing bigger, more critical things. Don't try and use one process to do everything or when architecture to do everything because you'll get locked into a sort of a heavyweight process that isn't appropriate for everything. The other thing is to measure cost per deploy and drive that down. So you want to figure out what does it actually cost you to deploy in terms of dollars and people in their compute time or whatever? How many meetings, how many tickets were filed by deploy? You should have at least one to track that it happened with lots of automatic appending going on to it, but you don't need to have a ticket for lots of permissions and things all the way through. And then zero meetings per deploy; because if one developer is deploying a hundred times a day, they don't have time to have meetings with people about all of those things they're doing. That's part of that process. So a few good books here; Hypothesis Driven Development. There's lots of good discussion of this in the Lean Enterprise Book. Unlearned is actually a new book by one of the authors there about how to unlearn all your old habits barrier. Barry O'Reilly brought this out just a few months ago. The theoretical basis for doing all of this comes from Doug Reinertsen, that says a lot in here about queuing theory and the idea that you should make your units of work roughly the same size because you'll get better flow. If you have a big project mixed in with small ones it will crowd everything out. There's those kinds of issues he talks about. Then there's the Accelerate by Nicole Forsgren and Jez. Jez is on most of these books. Lots of the same names coming up. The survey data across lots of organizations showing that low latency time to value works. You get better outcomes, you have better growth in the company, all kinds of things get better. So let's dig into this book. Some of the data in this book, the fast Companies are 440 times faster than the slow companies in that time to value. What that means is that some people are doing things in months and other people are taking hours. And this isn't big companies versus small companies. The fast companies include large, small, public sector, commercial, all types of organizations are fast. And in those same market segments, there are organizations that are slow. So it's much more about the way you think about organizing yourself than the market you're in or whether you're a startup or not. However, since this book came out, they did an update. This book was based on the 2017 data. The 2018 data just came out, it came out last year obviously. And now there's another group of companies doing stuff in minutes instead of hours. And so this is basically cluster analysis, so there is a significant number of elite organizations doing stuff in minutes that other people are doing in months. Seven times lower chance of failing in production and 2,555 times faster lead time from commit to deploy. So what we want to do is learn to do simple things quickly and this unblocks innovation across the whole organization and avoid the sort of complex one size fits all processes, because by the time you finish defining what that process is, you could have done a bunch of stuff and it's usually obsolete by the time you've over defined it. So one of the traps I see is central architecture teams over defining what the architecture and the process should be, when you should just get out of the way and concentrate on focusing on a few other things like security, availability, scalability and not trying to over define the way the system should be built. So best IT architecture today, it's minimalist, messy and actually inconsistent because otherwise it wouldn't be learning, it wouldn't be evolving. These guard rails for security, scalability and availability is designed to evolve really rapidly and explore new technologies, and it supports low latency, continuous delivery. So that's the best IT architecture and some people have got really good at this. Other people are working towards it. All right, so that's time to value. I spent quite a bit of time on that. I'll get through the rest a little bit quicker I think. Let's talk about the next step. So once you've learned to do simple things quickly and repeatably, the next thing is typical sort of a cloud native architecture for greenfield large scale deployed systems. So let's say you want to build like a Netflix style backend or a mobile backend or a distributed network system that never ran in your data center. Typically, this is the cloud first, cloud native architecture. The cloud gives you the ability to distribute globally so you can put machines wherever they need to be. You get cost optimized, high utilization. You order scale everything so that you basically turn things off when there's nothing to do, and use a cloud native architecture. So some principles for cloud native, you pay as you go, and you pay a month after you've used the resources rather than in a data center you pay upfront. It's self service. There's no waiting, it's API driven. It's globally distributed by default. You're using cross-zone and cross-region availability models just as the way you just naturally set the thing up. Extremely high utilization. Your utilization should be many times higher, not just a little bit higher. Like if your data center, it's pretty hard to get much more than 10% busy on average across the systems because you need the capacity for peak and maybe at the weekends or midnight or whatever, there's nothing going on in them. The way that you should work on, and these architectures just scale everything down, scale it to zero if there's nothing happening and use a mutable code deployment. So you're deploying alongside the previous system. You're doing blue-green pushers basically. Let's look at how we should do this. And there's some choices really now between should you use containers or your serverless or both to get this done? So I'm going to start with a little analogy here. By the way, I'd like to say I do not do my own slides, you can probably tell. I failed out in high school and so I reached out to this company and got a nice relationship going where I send the money and they send me slides that look like this. The company is called Silver Fox Productions, happy to have referrals to them. I said what I wanted was a picture of a Lego space ship and they did this whole deck for me. So the user need is we want so build a toy space ship, that sort of millennium falcon or something, I don't know, if you squint at it. They're not too close to be infringing anything. So the problem we're trying to solve is to make a model space ship quickly and cheaply. And the way we would do it traditionally is we design the prototype, we then like carve it from modeling clay to get the thing we wanted, the right shape. You then make molds from the carved model. This is the way that most plastic toys are made. You produce these injection molded parts, you assemble all those parts with some robot-looking thing, and then you sell this finished toy. And the problem is that this takes potentially months to do. A good bit of 3D printing in the middle, this is kind of the way you do it nowadays. But still it's something where ... And this is sort of like the way we do traditional development, it's a step by step approach. The other approach we have, very rapid development, which is you have a big bag of blocks, Lego bricks, you have some instructions and then in a few hours, and this is my favorite animation, I don't even know how you do that in PowerPoint, but anyway. You have this finished toy which looks vaguely like the thing that you wanted, but it's made out of Lego bricks. And you just put it together. Now if you look at it, it lacks fine detail, it's a bit lumpy. You can see what it is, but it wasn't exactly what you wanted. It's really easy to modify and extend though so you can add little pieces to it. And then you can optimize it. So let's say I want the point ... Lego bricks aren't very pointy. I want to point to a bit here. So I can just build a custom Lego brick. If you look at what Lego does, they keep throwing in custom bricks that look like nose cones and stuff, or wheels because it's really hard to make a wheel out of Lego bricks. They're not round enough. So, you form these new things. You kind of get the analogy here, but you're building now a more specialized component that you can use and reuse in other places. All right, so what are we talking about here? Traditional and rapid. Instead of full custom design, we go to building bricks. Instead of months of work. You can do this in hours, custom components that you have to debug them, they're hard to build. It takes a while to get them working, but it has standard, reliable, scalable components. Everything basically just works. You get too many detailed choices if you're custom building and if you're working with Lego bricks, you're fixed. You just go faster because there's much less argument and discussion. There's only one way to build it. And so that cuts your decision cycles down, and the constraints in an architecture are one of the ways you move much faster. So finally getting to the point here, sort of containers and serverless. So if you're building something with containers, again, custom code and services, you can spend a long time just trying to figure out which container framework you want to do with serverless. You just get on with it and start building it, much more standardized choices. But what you typically do is build it. What I'm seeing people do is build really rapidly with serverless, and then optimize. So you rapidly throw something together then you say, "You know what? This bit here, I need better startup latency, or it needs to be a bit more efficient or I'm going to do some long running jobs. So we'll do that bit with containers." That's the combination. So I'd say serverless is great for a rapid prototyping and then you go build your thing later on, optimize it later for maybe large scale production if you need to. So that's the at scale piece. We are going to wrap up talking about this most strategic critical workloads as we get into the data center replacement. So we're starting to see data center, the cloud migrations for the most business-critical, safety-critical workloads. There's multiple airlines going all in on AWS. The entire airline doesn't want to have any data centers. There are several banks doing this, there's a bunch of industrial automation companies doing this. So we're working on things that are safety and business critical workloads. And we're figuring out the patterns and solutions across all these different industries. So here's a question, do you have a backup data center? Is everyone here? Hopefully you are all still awake out there. Who has a backup data center? Yeah. Few of you. Okay. How often do you fail over to it? People start looking worried about this point and say not often enough? And then how often do you fail over the whole data center at once and it turns out if you're in finance the answer had better be once a year because the auditors come and make you do it. And everyone gets bent out of shape for a few weeks preparing for that. So that's normal. And availability theater, you invested in a backup data center but you didn't really get any real value out of it. So this is a nice fairy tale we have, once upon a time, in theory, if everything works perfectly we have a plan to survive the disasters we thought of in advance, not a great ... It'll keep you up at night basically, and it's not a good fairy tale for bedtime story. So here's a few things. There was the SaaS vendor that forgot to renew their domain name. They were down for a couple of days. The only thing they had left was Twitter, and the CEO apologizing on Twitter for about two days because all the email was down. They use one domain for all their internal email and with product and everything. And if you think about how disastrous that would be, sort of like, "Okay, well how would that apply to wherever you work, if whatever your works name.com disappeared or whatever it is. There's a few things you can do about that like make sure your internal email isn't on the same domain as your product, for example. Or have multiple DNS suppliers or whatever. I would practice flipping around with DNS or something or make really, really sure you remember to renew your domains. So this one, everyone's seen this. Only occasionally do people forget to do the domain name. If everyone in this room hasn't suffered from a security certificate expiring and taking something out I'd be really surprised because it's happened. This is universal. We have like really detailed systems, AWS for registering them and tracking them and making sure they all get revised at the same time and all this stuff. We built something like that in Netflix as well. It's a lot of work and it's really annoying when they go wrong. Recently I think Erickson shipped something with a security certificate that timed out to the UK cell phone system, and I think it was one of the cell phone vendors in the UK went offline for multiple hours, totally offline, countrywide outage. And there's like several hundred million dollars worth of lawsuit going on right now from a certain time out. So that's probably the worst case I can think of right now. It turns out computers don't work underwater, and when the water goes away and they're full of mud and bits of seaweed, they also don't work very well. A friend of mine spent several months rebuilding a data center that was in a basement in Jersey city, and it was not a happy place. So these things can happen. Hopefully not you tomorrow, but could happen. How do we deal with this? And really the key thing here is to have fast detection and response because you can't really figure out what will happen. I mean, you can take the obvious things you can work through. Once you've worked out how to fix all the obvious things, it will be something you've never seen before with the type of outage you get. Chris Pinkham was the engineering manager for the first version of EC2 and I was on a panel with him years ago when he came up with this phrase. I just wrote it down. I've been quoting him ever since. Here's the problem. There's a really good paper by Peter Bailis and Kyle, Kingsbury that the network is reliable, which of course it isn't. That's the first fallacy of distributed computing. And it says a long, long list of the way that things don't work. This book's really interesting. It says even if you do everything right and nobody makes any bad decisions at any point in the chain of events, you can still have a catastrophic outage. And there's a bunch of examples in this of planes crashing and people dying in hospital. I tell people not to watch it, not to read the book on a plane. The chapter two of this is about a plane crash into end and I did a talk at Go To Chicago last year, you can go find that talk if you want to see me doing a complete end to end discussion of a plane crashing. Just actually it's fairly traumatic. But anyway I'll go onto that more. But there's a lot you can learn from the airline industry about how they share. Every time there's a failure on a plane, it's shared with every other plane, from every manufacturer, every operator, there's a central registry where all failures are reported and shared. That's a really powerful concept when you need to start thinking about doing that kind of thing in IT as well. Release it: The first edition of this book was really critical. It was one of the books we read, Netflix a decade ago. We sort of dived in, we built a whole bunch of things. So if you had a circuit breaker there's a design path that came out of the first edition to this book. Michael upgraded the book a couple of years ago. There are lots more good ideas in there. Resilience: In the past we had disaster recovery and now we have chaos engineering, and in the future we're just going to have some kind of resilient critical systems. That's what we want to get to. So chaos engineering, sort of the history of this probably goes back to Jesse, who was one of the founders of Chef later on. But while he was at Amazon, he was called the master of disaster. He used to go around pulling plugs in data centers and making sure that the Amazon website didn't go down just before AWS. The other thing is he's a part-time firemen, like his spare time job. So he kind of deals with this emergency services idea. And he once simulated what would happen if a data center caught fire? He didn't just pull the plug, he says, "I'm going to shut it down in a sequence that simulates if there was a fire in this rack here, how would it spread and what order would various things fail." He simulated that once. And then if you think about a data center flooding, you discover whether you have cabling underfloor or above rack. Let's just bring the water up and like, okay, which bits of the data center are going to fail first? Okay, I have on my power distribution under floor, that's probably not going to work out so well. But if it's above rack, things are actually going to work for a little while before it fries itself. 2010, Greg was our ... I think the first implementation of chaos monkey was the first thing we ever shipped to the cloud before any applications went in there. The chaos monkey got there first. And if you're trying to do chaos, the sort of chaos they're engineering, always deliver the chaos processes before you put the applications in because otherwise no one will ever let you do it afterwards. I was like, "You'll just break everything." If you have to survive the monkeys to get into production, that's the right way around to do it. We open sourced it in 2012. Gremlin who are out here and I think Colton's on later, that was founded 2016. Chaos engineering book came at 2017 and I think the first version of this slide I wrote I was at the chaos conference last year, starting to see more startups. This is becoming an interesting area. So the way I think about chaos architecture is four layers, two teams and an attitude. This is the book that describes all the different things. The layers are the people at the top. So you want to make sure that they don't mess things up or break things that are working. The application has to be able to survive, seeing something it wasn't expecting. You better switch between things. That switching is probably the least well tested part of your system. If you think about a disaster recovery fail over to a backup data center, that is a switch which you didn't test. Every time you tried to do it in anger, it will probably break horribly. Because it's the least role-tested part of your system, it's the thing you're relying on to make you highly available. That just sounds wrong to me. On infrastructure, you've got to have more than one way to do everything. So chaos engineering team will use a bunch of tools to work through these layers like game days or the way you exercise people. If you're trying to get into this, if you do nothing else, nothing on the tooling side except have a game day, then at least people know what to do if something breaks. That's the first place I'd start. And then you can start seeing, "Okay, here's the tooling that we can make slightly better. Here's the way we can improve it." And then on the security side, it's actually very similar. The red team idea for security is that there's a team trying to break into your systems and there's a team trying to defend. And that approach actually means that you're testing everything as you go forward. The problem with both of these is you're only as strong as your weakest link. You've got to have a dedicated team trying to find that weak link in order to make sure that you're really going to survive. The other thing about failure is they're a systems problem. They're a lack of safety margin is the problem. And if you see people saying the root cause was component error or human error, that means the system wasn't designed to make that individual error go away or cope with it. If you have enough safety margin, human error wouldn't take out the system, or a component failure won't take out the system. So this is the core idea in sort of the new world of thinking about safety and resilience. What we want to have is experienced staff that know how to get on call, they know how to fix things. They know how the tooling works. They know how to get into all the different dashboards that they need to log into. You've got robust applications that have at least battle tested enough that they're not going to keel over completely at the first sign of something strange, a dependable switching fabric that you exercise regularly, maybe every few weeks or so on a redundant service foundation with whatever multiple ways to get things done that you need, networks and storage and regions and things. Cloud gives the automation that leads to chaos engineering. That was why we did it, why it came up at Netflix. Now we could actually automate our disaster recovery plans and build the whole thing by using APIs and a whole bunch of things. So the thing is that if you're trying to fail over between two data centers, the problem is they aren't the same, they drift. And when you fail over, you find that there's an out of date version here. Or the configuration wasn't exactly the same, or a limit there wasn't the same and it doesn't work. Whereas when you're API driven, you can run scripts that makes sure everything is the same in both places, and the services are the same in both places. So what I'm seeing is that failure mitigation, it's gonna move from this scary annual experience to automated, continuously tested resilience. And we'll be running that all the time. As we do that, we're going to be automating what we do, we're going to be needing to have a lot more ability to control things in production, and there's just a lot more version. So that is, I think, why things like LaunchDarkly and feature flags are really important for how you tie all this together and how you get on top of it. Because there's just a lot more features, a lot more stuff to do, and the state of the system is much more complex, but the outcome is much better. That's what we're all going to get to. All right. So I'm done. I got a minute or so for questions. If anyone wants to ask one. Audience Member: [inaudible] Adrian C: Yeah, there's versions. I usually put my slides on GetHub. adrianco/slides on GetHub. I have this idea that maybe one day Microsoft, since they own GetHub now, will make me do poll requests on my PowerPoint decks. But they haven't built that yet. Yeah, I'll put them up there. There's also similar versions of things around. Any other questions? Yeah. Audience Member: [inaudible] Adrian C: Yeah. So the question is about if you're building real time or critical systems like Boeing for example, probably the best example probably right now is Tesla. They do updates roughly once a month, something like that. They have a fleet of test cars they're deploying to all the time and they have feature flags in their code to turn these things on and off and updates. So I think that you still want to do lots of small incremental changes because each change is going to be much more separable and your ability to debug it is much faster. I went to see a Space X one time, I got a little tour and they had this huge bench with the guts of a rocket laid out on it. All of the electronics were rocket, and they have all the different versions of a rocket on a different bench. And then they boot the software in that and test on that. So they're doing continuous delivery into a hardware simulator. So I think that's the approach I take for something like an aircraft critical system. But you are not delivering once every six months or something you're doing, you're building the test driven development sort of framework so that you're running continuously with very rapid updates because that's just going to let you go foster at building things. I think that that's one of the interesting things. If you look at the way Tesla as a company is constructed versus other car manufacturers, they're on like a four-year release cycle for new product and tells us some sort of a monthly one where they continuously modify the car and make them slightly better. And that's because they go direct to consumer. They don't go through read through resellers, and actually that's ... the whole structure of the company is quite interesting. The way that they're able to operate is actually completely different to the way the other car manufacturers operate. It's a good example of a sort of industrial, continuous delivery operation, although there's a whole bunch of other issues. But it's a fascinating to watch. Yeah. Audience Member: As a developer, [inaudible] Adrian: So the question was how do you push this up through management? Hopefully the management is trying to push down to make you do something more strategic and get things to go faster. I'm seeing a lot of enterprises say, "We need to go to faster. The business needs stuff delivered and they've been blocked by IT or whatever. They've been blocked by the speed of innovation in the organization." So it's a lot of top-down incentive to move faster. And then usually the engineers coming up from below kind of go, "We like to do diverse ways that could go fast." And there's a layer of management in the middle that's getting in the way. And it's quite often project managers. If you think about moving from project to product, when you do that, you don't need project managers anymore, so they see the jobs going away. So hopefully not too many project managers in the audience here, but if you're a project manager, you should probably be trying to find a different job because we're going to decimate them. I mean they need about 10%. After you've gone through the transition, maybe 10% of what you needed before. Because if you're not running projects that take a year to run, you're doing continuous delivery, you need a whole different skill set around it. So that project to product book talks about that in quite a lot of detail and there's some examples of companies that have gone through that transition. But there's a corollary here that see their jobs going away and they will resist, so they have to kind of like pencil movement on them. By the way, if anyone wants me to come by and chat, I spend all my time talking to customers and talking to management, trying to persuade people to speed things up, I'm happy to come and give this deck to groups internally. I did a pre-run of this deck to a bank last week in London, so that's the kind of thing I do. I'm happy to be here and I'll be hanging around the rest of the day to carry on the conversation. So thank you.

Ready to Get Started?

Start your free trial or talk to an expert.