• Overview
  • Transcript

How to Build Self Healing Systems and a Kill Switch. (Error Monitoring at Its Best)

Francesco Crippa Rollbar

On-call schedules, constant reorganizations, fast turnover, impossible deadlines, new platform and new frameworks every day… Building modern platforms is becoming unsustainable for the new generation of engineers. In these challenging scenarios, autonomous systems offer a new hope. Not requiring human intervention, these systems offer a more manageable and sustainable approach.   Cloud computing, standardized APIs, common practices, and shared tools are opening the door to building self-healing systems – systems that can react quickly and automatically to the increase in load, recover from unexpected errors even before users can notice it, and run rollbacks and migrations without any human interactions. While all of this is technically possible, specific practices (feature flagging and error monitoring) and strong discipline are needed to build such systems. And above all, a new engineering culture needs to be defined to make the dream reality.

Downloads slides

Francesco Crippa

Francesco combines his passion for engineering and philosophy to build incredible engineers that can grow and scale as individuals and teams. Prior to Rollbar, where he’s serving as VP of Engineering, he lead big and small engineering organizations for large companies like Cisco and Zillow to redefine and transform their platforms, always providing a best in class developer experience and ergonomic. Francesco presented many engineering related topics in various conferences around the world consistently for the past 20 years, from opening Keynotes to technical sessions, workshops or as one of the organizer (events like Dreamforce, Cisco Live, JavaOne, RedHat Summit, Enterprise Connect, Linux Tag, Linux Summit, Linux World, PyCon, FudCon and many others).

(upbeat music) - Welcome, everyone. Today we speak about the Self Healing System and how we can actually use a feature flag in order to simply build systems that fix themselves when something goes wrong. Spoiler alert, this is not a talk about AI or machine learning or how we can write code to fix other code or how we can write software to write other software. This is very much about the bread and butter of how a feature flag can be used with automation to trigger automatic behavior when something is not exactly behaving as we expect them. So to jump into that. I'm Francesco - I run the engineering team here in the Rollbar and what we do is basically provide error monitoring as a service. And we'll see that this is one of the main building blocks in order to build this automation. But today we're gonna really dig deep into how we can build a kill switch using LaunchDarkly feature flag. And my goal in the 30 minutes that we have is really to write some code to show you that it's very easy to build basically a self-healing system. So let's jump into that. 

First of all, I would like to bring a little story about my past. So, you're seeing a screen here that, I love fairy lights and I love string lights, but I have a weird memory about my childhood. So I remember when I was a kid and the day we were putting decoration for the holidays was happening. I was very happy, very excited. But the memory goes pretty much like this. So, we take the big box of fairy lights out of the basement. And the first thing to do with the fairy lights is we test each stream to see if actually they work before you put them all around your house. And they always do. So they always work, they always are fine. Then you started decorating your house. You spend hours in doing so. And when you plug in the cable, they never worked. There was always one bulb that was off. And because that bulb was off, the full string was down. And I remembered this memory as an extremely painful memory. Like I was expecting every time I was connecting a fairy light to be broken. All the time. And I'm using these example because the issue here, I think, was very much into the design of the string lights. Now, as many practices, engineering tends to optimize for costs at the very beginning of something and optimizing for maintenance after, in the life cycle of a product. So back in the eighties when there was a kid string lights were cabled in series. That means that when one bulb is broken, the full string doesn't work. And that's what creates the issue. So if you attach five strings with each other, the chances that one single bulb is broken is really high. And the only signal you get is that something's broken so the lights are off but you don't know which one it is. And this to me changed a lot when the design of the fairy lights change. Right now there is no way you can buy a string light where the bulbs are cabled in series. They're all cable in parallel. So when a bulb is broken or simply you see that that is the only bulb that doesn't light off and you can easily replace it. But that was not the design at the beginning. And that's a common pattern that we see in engineering. 

Another story I would like to bring again from my childhood was: I was driving in my father's car in this case, and the temperature water in the car went all the way up. And I immediately stopped the car, called back my dad at home and said,  "Hey, the car is broken. " And the question was,  "What happened? " And the answer is like,  "I don't know what happened.  "I have no idea why the cooling liquid in the car  "is so hot. " I just know that it happened and the car basically asked me to stop immediately because it's not safe to continue driving. I'm using this example again because this is a clear pattern we see in engineering but not necessarily that often in software engineering. So while I had no idea what happened to the engine of the car, I had a clear idea of what I had to do immediately after. I had to stop the car and asked for help. And we probably see where I'm going with these examples. There's this picture, is basically a good representation of what we talking today here. But electrical circuit breakers are designed not as a system that in case something goes wrong, they just cut down the electricity. They are design in a way that they work only if the correct inputs are applied. When something is not as designed, they just automatically switch off. So the way these are stable is only when the right conditions are met and not during the event when things go wrong. 

I'm using these three examples from my personal experience, from my life, because I think they tell a very good story about where are we in terms of industry with software engineering. So, software engineering is a practice that is relatively new and a lot of common sense that we see in our life every day is not quite applied yet. And so today my goal, again, is to try to build in few minutes the ability of having some of those elements, some of those entities. If something's broken I know exactly what it is because it's evident. So it doesn't require further investigation. It's there, I can see it and I can immediately act. If I don't see it, I have anyway a clear idea of what the action is and I can automatically trigger that action. And again, the goal is building a circuit breaker like in real life. So, well, this is great but what do we have in software engineering? Right now, this is what I think most of software engineers experience every day. We live in an industry that is absolutely obsessed with the idea of root cause analysis, the idea of obseverability and the idea of aggregating all of these into one single entry point. And great tools around but generally know all the tooling we have tend to have the same sort of issue. So when we see what's going on the only thing we know is something is wrong. And we don't know exactly what is broken and what we need to do in order to fix it. And so we believe that developers need a better way to see what they're doing. Now, we live in a world where traditional monitoring is not really providing what the current complexity of systems require. The most visible aspect of this is that there is basically one single workflow that is very common across all the industries about error response, which is automatically triggers the page. So someone wakes up in the middle of the night and ask the person to figure out what's going on. That's pretty much the level of automation we have today. And we don't go more far than that. So we wake up someone because we need someone to work on something, but that is pretty much a manual process. And we believe this can change. And we believe that feature flags are one of the main enablers of these new workflows. 

Another thing that changes a little bit in the industry is that the job of a software engineer is not the same that it was 10 years ago. We don't build software that is done at some stage. Now while SaaS and cloud deployments and building a system as a service was not the mainstream way to build software for a long time. Today is basically the only way to build software. It's the vast majority of the industry. So nothing is ever done. Nothing is completed. And things keep changing every day. We push code in production every single day nonstop and it's never done. We don't work anymore in waterfall. We don't work anymore with stable constraints and stable requirements. Everything keeps changing. And again, the industry adapted with something that is basically a little concerning, which is again, software engineers' own call. And if something doesn't go as we planned, well, they're going to try to fix it somehow in the middle of the night. So, because one data that... These are search counts from Forrester, which blows my mind all the time. That is basically 37% of the industry. So almost one company every two has full continuous delivery and continuous integration capabilities. So they invested in having full automation from dev to production. You press a button and you run all your tests, you do your quality assurance and you go to production, fully automated. So, billions of dollars of investments in the past 5 to 10 years resulted in a lot of automation in place. But weirdly enough, only 1% of the industry is actually pressing that button every day. Everyone else is just lacking the confidence to do so. So the tool is in place, the architecture and infrastructure is in place but the confidence to press the button is not there. So this brings us to the first building block that we are going to see today. That is the Error Monitoring panther. Error monitoring is a new practice that rely on something that has been a little kind of forgotten in our industry stack traces. Stack traces are by far a piece of information that more than anything else tells you what's broken in your system because they are mechanically done by your run time. So they are extremely accurate. You don't miss a stack trace. If something breaks, you do have a stack trace. They are extremely accurate because they are exactly the story what's happening. They're very technical because you see all the stack and what things went wrong and they're connected to the code. So you know exactly which line of code was the one that had the bad impact in the system. So, if they're so valuable, why, the industry is not using it normally every day? Well, because unfortunately they're also very hard to consume. 

So stack trace was designed almost 50 years ago in a different time with different requirements and basically they can be a pain in the back to really analyze and use properly. So, we live in a time where most of the stack traces are just considered strings that developers might be using to debug some issues but no more than that. Now, we'd built a very different way to process the stack traces. And this is where error monitoring as a category started. So, to make an example, imagine that you have a system and this system is not perfect. So it's throwing errors. And all these errors are somehow similar to each other and somehow different. So in this example on this slide you can see some shapes that are saying... Some errors have the same shape but different colors and other errors have the same colors but different shape. In real life, it's a little more challenging than that. It's not exactly the same but you can have a lot of an outpoint or exceptions in your system but from different line of codes. And you can have the same line of code that is storing different stack traces. For example, because the inputs of that line are different. So if you pass an integer you get an error, if you pass a strain you get a different error. So because of that, again, most of the tooling out there is just using stack traces to show a log of things that happen. In error monitoring what we do is actually applying a lot of magic (indistinct) to basically extract some more clear signal out of the stack traces. So in this example on this slide, it was just as we move the colors, it's not that easy in real life. We need to do a little more complicated computation and transformations. This is where we apply machine learning. This is where we apply a lot of statistical analysis. And also this is where we apply a lot of very basic tracking. For instance, when an error happens to line number three of a file and I decided to add an empty line before that. Now this error now is going to happen at line four, a very simple change but these will generate a completely different stack trace. So this is why, just to give an example, why it's hard to rely on stack traces if they're not aggregated properly, because they are very hard to track over time. But that's what we do in error monitoring. We do that as a service. And the outcome is actually something very different because you don't need to have details necessarily for each instance of the errors that are happening. You probably want these automatic categorization and classification of errors to make sure that the most important errors are the ones you are focused on. And so for instance, if a stack trace never happened in the history of your system, ever, but starts happening for 50% of your users, the moment you deploy a new version in production this is a clear signal for the person that wrote that line of code that something probably is broken. At the same time, if you have an error that was happening every single day for the past 10 years for all the users and you never fix it, probably this is not an error that you cared that much. So it's better not to have that noise around. So the deal in error monitoring is really to distill what matters and to present it in a way that is accessible programmatically. And in doing so, this is the result.

 So this is how stack trace looks like in a error monitoring solution. So in this case, you see that something blew up, but it's the line 192 that is actually the reason. And you'll see exactly which variables are responsible for that. You can also inspect the variables so you see which values in production, in stage, in depth, which values of these variables triggered that error. And again, this is something that I'm doing without even reproducing the issue, just clicking around when. I get the notification that something is broken. And because it's programmatically available, I can start building a deep fabric of connection with other tools. For instance, with LaunchDarkly but also (indistinct) maybe. If there is a bug, I want to open a ticket or ask someone to fix it, for instance. And so because these are mainstream more than APIs it's actually very easy to start connecting things and building much better automation out of that. And this brings us to the real point of today. So how do we build a kill switch in our application? And how do we make sure that when something's broken, we just take that feature out until someone, not tomorrow and slowly, not in the middle of the night is going to find why things are broken. So, to do so, clearly the building block that we need is a feature flag. And since this entire conference is about feature flags I'm not going to get too deep into what they are and how to use them. 

There are a lot of talks that can tell better than me how to master this technology. But I would like to give you a quick demo of how these might look like in real life. So, I prepared here a very simple application. It's a widget that is motivating our team, telling them how many days left there are in the quarter. So, if you're a salesperson, if you're in a team, you know that that's an important information. I prepared also some groundwork. So I have a project in. Rollbar that is set up and a nice feature flag on LaunchDarkly. And we can run our application here and see how these look like. The best part of this app is motivating people using also very nice font telling you that  "Don't panic, " there is still time till the end of the quarter. So we have (mumbles) widget. We can bug it around then we can put it in Salesforce but as you see in this code, we also want to add a new feature that is creating extra motivation providing some quotes that can help reaching the goals. So as you put the number in that URL now you start consuming an external service that is giving you some inspirational quote that might help you. And if (mumbles) more quotes, well, get more motivation and that's totally fantastic. But at the same time, we are increasing the complexity of the application. Now this simple widget is relying on an external application. Now let me give you a gist of how error monitoring can be useful. I have a script here that is throwing a bunch of weird errors. It's using the app in a weird way. And you see that role that automatically catch that something is not quite working correctly and is providing one new error that hasn't discovered yet. And if you see, you can see the line of code that is generating the error. And you can see that, in this case there is some sort of conversion for a string into an integer. And of course, I can expect that these variable and say,  "Yeah, code count is actually a string  "but the code doesn't account for that. " So I need some, not exact perfect coding and a standardization of my inputs. But if I see here is also that there are like 42 different errors of the same type with different strings. And still these for an error monitoring tool means there is only one error that needs to be solved. 

So this is the power of error monitoring but in this case, let's assume we fix the issue. We don't care too much about input standardization for now. And so let's take a solve and let's pretend the code has been fixed. Then let's see if we can protect our new feature, these quotes and motivational expansion through a feature flag. So I switch on this feature flag and I cut and paste my feature. And let me wrap this pact of the code with this new feature flag.

Now, you're probably very familiar with this. So I'm gonna add the feature flag. I'm going to get the LaunchDarkly client and I'm going to just put these few lines of code under the feature flag. If you're using common best practice for releasing software incrementally, probably you don't need to do any instrumentation in the code. So, very likely, the feature flag you were using to hide the feature in the first place might become the flag that you want to use to actually trigger the kill switch with automation. But in this case, let's put some data. We don't care too much about the user because it's just a demo and as you can imagine this is not your application. And there we go. Now we have our feature flag and let's wrap our quotes through this flag. Now, if the external service that is providing this quotes is down, you can imagine that the entire page will crash. And this is what we want to avoid in the first place. So our application is still working. Let's get a bump of motivation through even more quotes and let's see how we can actually instrument the automation. 

That is the interesting part. Well, first of all, we need to identify what to do when things break. And I did some pre-work here. These files you see over here is just... There are multiple lines but there's one single HTTP call. We just see the head and the payload. This is an API in LaunchDarkly to basically change the value of a flag programmatically. So if you call these methods... What this does is basically switching off that flag. Super simple and there are millions ways to do it. It doesn't need to be in code. It can be a lambda function, it can be somewhere else but just an example for now. Now back to Rollbar, let me go into settings here and let me configure the automation to automatically switch that off. Now, I'm going to use different ways to do it, but I'm going to use a. Webhook for simplicity here. And since the production environment where this application is running is actually my laptop, I need to expose my application using ngrok. So I'm going to use ngrok to get a public URL. There we go. Which I cut and paste in Rollbar. And the URL is actions/killquotes. And here we go, I enabled the Webhook and some magic happens here. Now, automatically Rollbar provides a pamphlet of a rule engine that will trigger the notifications. So it's very configurable, of course, and we can do a bunch of things. What's the error rate we want to use? Only new error. So we activated errors in our regressions. Only for production, only for staging. Bunch of things that we're going to totally skip for lack of time today. But let's say, if something is error critical in production, just click the button and take down that feature. That is the meaning. And with these in place, let's see what happens if I simulate, for instance, that this external service that provides quotes now goes down. So, the system is up. I have a script here that creates the error. This service is down. It could be I consume my quarter or whatever it is, but you see if I log the application, the application doesn't break. The application just got to rid of that extra feature. And if I go back to LaunchDarkly and refresh the page, look at that, the feature flag is off automatically and we didn't write a single line of code. So this is a super simple example, an extremely simple example of how a system that is relying on error monitoring and feature flag can actually be used to achieve very high valuable automation with basically a point and click kind of interface. So this is a simple example but it's very powerful because what we're dealing with that is really and truly a self healing system. So we don't need to wake up in the middle of the night and try to disconnect or disable this external service because it's not available in these killing our lead page and (indistinct) forth. For instance, we can just wait for the service to go up again and click the button to enable it. That's pretty much what we need to do. 

So, some takeaway. I know there's not a lot of time and we cover a lot of things but my main takeaway that. I would like you guys to remember from today is that a self healing system as a pattern is very much available and is very much something that today's software engineers need. It's something that is easy to do. It doesn't require weird investment in AI and machine learning and complicated things. It's just really using best practices and connecting dots in a powerful way. I also believe that our industry, the tech industry desperately needs a new way to consider errors. Now we are putting all the issues, systemic issues we have in our systems on the shoulder of software engineers. We keep basically asking people when something breaks, figure out what it is and fix it quickly, as quick as you can. And maybe you should start thinking differently about how we react to errors. We can easily take things that are not critical down and slowly fix them when we have time. And without having any business impact on what we are delivering. I'll leave you with this quote from William Gibson, that probably all of you heard many, many times,  "The future is already here  "it's just not evenly distributed, " because while feature flags are becoming the most common pattern you see in software engineering, I also believe that today not many people have tried to use feature flags as a building block for more complex and more interesting architecture. And this is really much the beginning of this discovery. So, this is everything. I have for you today. And hope to hear from you soon. Have a great day. (upbeat music)